greenelab / tybalt

Training and evaluating a variational autoencoder for pan-cancer gene expression data
BSD 3-Clause "New" or "Revised" License
162 stars 62 forks source link

Adding Simulation Evaluation Framework #112

Closed gwaybio closed 6 years ago

gwaybio commented 6 years ago

Related to #103

Completely updating old method of simulating data - now using WGCNA simulateDatExpr

I've also added four evals, which include:

  1. Mean ranks of module genes assigned to compressed features (modules have decreasing number of ground truth genes)
  2. Mean rank of the list in 1 (except the noise module ranks)
  3. Euclidean distance reconstruction cost following LSA (A - B + D = C_hat) (dist(C, decoder(C_hat))) where A, B, and D are mean latent space encodings of 3 groups of samples. Groups B and D lack gene module 2, Groups A and C have gene module 2.
  4. LSA isolation of compressed features detecting module 2 in above sample. In effect, testing if subtraction of B from A isolates "essence" of gene module 2.

I have also added a verbose argument to the models class which will control how training metrics are output.

b2f2bba partially addresses #13

This is only the framework, results to come in future PR!

gwaybio commented 6 years ago

also tagging @huqiwen0313 for possible thoughts on simulation evals

gwaybio commented 6 years ago

Thanks @jaclyn-taroni and @danich1 !

I have updated commits based on your comments. It is still not yet ready for re-review however, I believe I found a bug in the simulated data script that I will need to test when I get back. I will let you know when it's ready again!

gwaybio commented 6 years ago

Alright! I think i have addressed the previous error in fa194fb3baba2cd5ec856b627c927ab97ad57456 (specifically in lines 108-109). Before, I was simulated only 3 sample sets (by row). This would produce signal, but not exactly what I had intended.

Before the sample set was being used to generate the eigen_samples matrix. Generating an eigen matrix with 5 gene modules (but only with 3 "sample_sets") resulted in this sampling:

Gene Module 1 GM2 GM3 GM4 GM5
Sample Set A B C A B
C A B C A
B C A B C

Instead, with 5 "sample_sets":

Gene Module 1 GM2 GM3 GM4 GM5
Sample Set A B C D E
A B C D E
A B C D E

Where, the rows of eigen_samples correspond to samples and columns are gene modules.

Additionally, I added a couple figures in 828333a that describes an example of the simulated data.

@danich1 @jaclyn-taroni - Ready for re-review! Thanks! (The results of the sweep will be added in a future pull request)

jaclyn-taroni commented 6 years ago

@gwaygenomics is there a row and column of missing values in figures/example_simulated_data.png or is there just something wonky with the graphics that would be remedied by changing the size of the plot?

gwaybio commented 6 years ago

is there a row and column of missing values in figures/example_simulated_data.png or is there just something wonky with the graphics that would be remedied by changing the size of the plot?

Just wonky 💩 - I will fix before merging

gwaybio commented 6 years ago

Same comment as before, looks like @danich1 agrees

I approve. Just make sure you add an exception statement in the reconstruct_group function.

My bad, I must have just missed that one. Fixed in aa754c2