Closed gwaybio closed 6 years ago
In initial experiments with these kind of simulated data it has become clear that the simulated features are too simplistic. The evals based on these signals basically breakdown to asking if the models learn simple distributions. An initial test (without "noise" features) shows the compression algorithms learn these structures rather easily.
☝️ absolute value correlation matrix of compression features + injected signals
In order to better assess model performance, we need to either 1) use more complicated simulations (like WGCNA, or RUVcorr; but I don't think we need to use more complicated simulators like BEERS or Polyester) or 2) Identify or modify real datasets with known or artificially induced conditions to evaluate
We determined this with @huqiwen0313 (and Nicholas Lahens) today. We also decided that the primary eval should be LSA. The module based structure of WGCNA is a current promising lead. We can also assess the ability of the compression algorithms to identify these simulated modules.
Future updates will impact the recently added simulation script, but the proposed DataModels class should still be valid.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Introduced in #103 - a simulation experiment is worth exploring. Currently, the simulation experiment samples the following signals (copy and pasted from #103):
As mentioned, some of the evals envisioned include:
Group 1: Yes Feature
-Group 1: No Feature
+Group 2: No Feature
=Group 2: Yes Feature
?)As suggested by @jaclyn-taroni, we will use this issue to track important citations and discuss rationale for the simulations and evals.