greenelab / tybalt

Training and evaluating a variational autoencoder for pan-cancer gene expression data
BSD 3-Clause "New" or "Revised" License
161 stars 62 forks source link

Simulation Experiments #104

Closed gwaybio closed 6 years ago

gwaybio commented 6 years ago

Introduced in #103 - a simulation experiment is worth exploring. Currently, the simulation experiment samples the following signals (copy and pasted from #103):

  1. Groups (based on different normal distributions along given across axes)
  2. Nonlinear functions (any transformation to a randomly sampled vector)
  3. Cell-types (based on a different proportions of normal distributions across given axes)
  4. Presence/absence (random determination if feature is present)
  5. Random noise features

As mentioned, some of the evals envisioned include:

  1. Ability to distinguish groups
  2. Identify nonlinear but continuous patterns
  3. Robust to noise injection
  4. Latent space arithmetic (e.g. Group 1: Yes Feature - Group 1: No Feature + Group 2: No Feature = Group 2: Yes Feature?)
  5. Generative ability of models

As suggested by @jaclyn-taroni, we will use this issue to track important citations and discuss rationale for the simulations and evals.

gwaybio commented 6 years ago

In initial experiments with these kind of simulated data it has become clear that the simulated features are too simplistic. The evals based on these signals basically breakdown to asking if the models learn simple distributions. An initial test (without "noise" features) shows the compression algorithms learn these structures rather easily.

test_heatmap_zeronoise_covariates

☝️ absolute value correlation matrix of compression features + injected signals

In order to better assess model performance, we need to either 1) use more complicated simulations (like WGCNA, or RUVcorr; but I don't think we need to use more complicated simulators like BEERS or Polyester) or 2) Identify or modify real datasets with known or artificially induced conditions to evaluate

We determined this with @huqiwen0313 (and Nicholas Lahens) today. We also decided that the primary eval should be LSA. The module based structure of WGCNA is a current promising lead. We can also assess the ability of the compression algorithms to identify these simulated modules.

Future updates will impact the recently added simulation script, but the proposed DataModels class should still be valid.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.