greenelab / tybalt

Training and evaluating a variational autoencoder for pan-cancer gene expression data
BSD 3-Clause "New" or "Revised" License
162 stars 61 forks source link

add scripts to simulate data #103

Closed gwaybio closed 6 years ago

gwaybio commented 6 years ago

The main function is getSimulatedExpression(). It will sample various types of signals including:

  1. Groups (based on different normal distributions along given across axes)
  2. Nonlinear functions (any transformation to a randomly sampled vector)
  3. Cell-types (based on a different proportions of normal distributions across given axes)
  4. Presence/absence (random determination if feature is present)
  5. Random noise features

I am not trying to comprehensively profile all possible signals present in gene expression data. Just a couple that will help with the evaluations I have in mind.

Some of the evals include:

  1. Distinguish groups
  2. Identify nonlinear but continuous patterns
  3. Robust to noise injection
  4. Latent space arithmetic (e.g. Group 1: Yes Feature - Group 1: No Feature + Group 2: No Feature = Group 2: Yes Feature?)
  5. Generative ability of VAE

I intend to follow up later pull requests comparing Tybalt, ADAGE, PCA, ICA, NMF, and a conditional VAE in these evals.

jaclyn-taroni commented 6 years ago

Do you have any citations for any of these approaches @gwaygenomics? I'm thinking it might be nice to keep track of them here.

gwaybio commented 6 years ago

Do you have any citations for any of these approaches @gwaygenomics? I'm thinking it might be nice to keep track of them here.

Do you mean citations for the types of signals, or for the evals?

Many of the simulation studies for the types of signals (that I could easily find) were trying much more complicated things than what I have here. Do you know of any papers that try something similar?

I agree keeping track of these citations is gonna be important.

jaclyn-taroni commented 6 years ago

Do you mean citations for the types of signals, or for the evals?

I meant for the types of signals/simulation approaches. I'm interested in the rationale behind these approaches, which may be beyond the scope of this PR. If there are relevant citations, however, that might be a nice concise way of keeping track of this.

Do you know of any papers that try something similar?

The CellCODE paper comes to mind for the cell type proportion bit, as does the xCell paper.

Also unclear to me (at first glance anyway) if getSimulatedExpression() does all of those types of signals at once or not.

jaclyn-taroni commented 6 years ago

You might also check out WGCNA -- I think there's functionality to simulate gene expression data with co-expression module structure which you may find useful.

gwaybio commented 6 years ago

I meant for the types of signals/simulation approaches. I'm interested in the rationale behind these approaches, which may be beyond the scope of this PR.

Got it. Yes, this makes sense. I think rationale is entirely within scope. I started issue #104. I think that is a better place for the discussion than in a PR (which code is primary focus).

The CellCODE paper comes to mind for the cell type proportion bit, as does the xCell paper.

Cool! Yeah from my quick glance these are focused on simulating cell-type proportion and use real data (from purified cell lines and assuming an additive model). I also assume an additive model here but don't use real cell types. I think using real data to simulate proportion could be important for the CZI grant, but I am not yet sure it will be important for next steps with Tybalt. If we decide it is, it will belong in a separate PR.

Also unclear to me (at first glance anyway) if getSimulatedExpression() does all of those types of signals at once or not.

Based on the given input (including how many samples and how many features of each of the proposed 5 signals), the function is able to output all the intended signals at once. They form different features in the n x p output matrix.

You might also check out WGCNA -- I think there's functionality to simulate gene expression data with co-expression module structure which you may find useful.

I have checked that out before - it looks slightly more complicated than what I have. For our purposes here, I thought it would actually be easier to write my own (that I have more control of). It will be important to cite, however.

gwaybio commented 6 years ago

Thanks for comments @jaclyn-taroni

I think I need a little more information/documentation before I can critically evaluate this @gwaygenomics -- want to make sure I'm following

I bolstered the documentation. Does it make more sense now?