greenelab / shared-latent-space

Shared Latent Space VAE's
23 stars 5 forks source link

Evaluation #21

Open chrsunwil opened 6 years ago

chrsunwil commented 6 years ago

How to Evaluate when the model is learning the common biology between the two domains?

gwaybio commented 6 years ago

now seems like a good time to start performing some evaluations. One (potentially) semi-quick eval that will help us decide next steps comes to mind.

  1. Take at random x samples in your testing set that are wild-type (y = 0) for TP53 (entrez ID: 7157). (for our purposes, lets say x = 100, but this can be modified)
  2. Push these samples through the shared VAE into expression space. Save the results. Ask the question, what is the MSE of these samples? Save the distribution of MSE results.
  3. Now, "induce" a TP53 mutation in these same samples. (make column feature 7157 equal to 1 in these samples)
  4. Push the induced TP53 mutants through the shared VAE into expression space and save the results. Again, ask the question, what is the MSE of these samples?
  5. Compare, with a paired t-test the difference between the two distributions of MSE values. This script may be helpful.
  6. Identify the differentially expressed genes between the two RNAseq values (induced reconstructed vs. reconstructed). This script may be helpful.
  7. Run a global differential expression analysis (same as the above script) in all samples with 7157 (TP53) wildtype vs. control.
  8. Create a scatterplot where the points are genes and the x axis is true observed differential expression and the y axis is induced differential expression - and output the Pearson correlation.

I think that this procedure will demonstrate how well the shared latent space is capturing shared biology between the two domains. I think it would be useful to code the scripts in such a way that the same procedure can be run with genes other than 7157.

gwaybio commented 6 years ago

another note - lets do this procedure with x training samples as well. May also be good to induce TP53 wild-type status (go from 7157 = 1 to 7157 = 0) and repeat the procedure.

I think coding the analysis to behave on any input gene in either direction will be important.

chrsunwil commented 6 years ago
7. Run a global differential expression analysis (same as the above script)
in all samples with 7157 (TP53) wildtype vs. control.

Is control just all of the examples?

8. Create a scatterplot where the points are genes and the x axis is true 
observed differential expression and the y axis is induced differential
 expression - and output the Pearson correlation.

So the y axis is the result from step 6 and the x axis is the result from step 7?

gwaybio commented 6 years ago

Is control just all of the examples?

Yeah, lets do that. This may help to visualize.

So the y axis is the result from step 6 and the x axis is the result from step 7?

Yes

chrsunwil commented 6 years ago

Do you have any suggestion for a python equivalent to lmFit in

fit <- lmFit(t(rnaseq_df[, 2:ncol(rnaseq_df)]), ras_design)

I was looking at https://lmfit.github.io/lmfit-py/model.html

I suppose I'm still a little confused about:

Identify the deferentially expressed genes between the two RNAseq values

How do I actually calculate the deferentially expressed genes? I was having some trouble following your linked script.

chrsunwil commented 6 years ago

I talked with Yoson, Nandita, and Casey, and I think I now know what I should do.