greenelab / tybalt

Training and evaluating a variational autoencoder for pan-cancer gene expression data
BSD 3-Clause "New" or "Revised" License
162 stars 62 forks source link

ERROR: VAE Model reconstruct the gene expression data #153

Closed Sithara85 closed 2 years ago

Sithara85 commented 3 years ago

Hi Gregory,

I am very happy to see how you have detailed the steps for gene expression VAE based model. I am doing some analysis on gene expression prediction model to classify dementia. I started learning teh applications of VAE model/ machine learning models in omic prediction models. Also I am new to Tensorflow/Keras.

I successfully implemented your code using your gene expression data after disabling the eager execution to make your program work in Tensorflow 2. But when we use our gene expression data (which is log 2 cpm normalized data), I am getting all the reconstructed values as 1.0 so my gene_mean and gene_summary remains same. I evaluated your data distribution ( it looks gene expression data in the range of 0-1).

Could you let me know if you can think of any issue with my input data shape.

Input:

Dimension: (3045, 10956) data:

XXbac-BPG248L24.12 TTN RP11-290D2.6 JSRP1 RP11-115D19.1 HCG4P5 AC114271.2 RP3-394A18.1 ABALON KB-1208A12.3 ... LNPK NBPF15 ATP8B4 AC005522.7 CHID1 ARFRP1 NAPB CTB-133G6.2 SPATA24 POU2F2
12.988314 12.83923 12.576047 12.136305 11.978625 12.494600 12.583452 12.415211 10.807796 11.77091 ... 3.774915 3.350120 3.774915 2.310416 1.684627 2.745442 3.350120 2.745442 2.310416 3.350120
12.641275 12.56024 12.252506 12.576744 12.883745 12.327777 11.751295 11.776363 12.345054 11.64166 ... 2.240370 2.953526 3.618031

input_rnaseq_reconstruct.head(2):

XXbac-BPG248L24.12 TTN RP11-290D2.6 JSRP1 RP11-115D19.1 HCG4P5 AC114271.2 RP3-394A18.1 ABALON KB-1208A12.3 ... LNPK NBPF15 ATP8B4 AC005522.7 CHID1 ARFRP1 NAPB CTB-133G6.2 SPATA24 POU2F2
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0  

gene_summary:

gene mean gene abs(sum)
11.867037 11.867037
11.774692 11.774692
11.622958 11.622958
11.489614 11.489614
11.474622 11.474622
Sithara85 commented 3 years ago

I just tries a trick on the data. As I have seen Gregory's dataset had all gene expression distribution ranges from 0-1 so I have multiplied my log2cpm data with 0.01 so all values ranges from 0-1. Now I have better results to visualize, So I am wondering where we are setting the input tensors to range from 0-1.

Sithara85 commented 3 years ago

New input_rnaseq_reconstruct:

XXbac-BPG248L24.12 TTN RP11-290D2.6 JSRP1 RP11-115D19.1 HCG4P5 AC114271.2 RP3-394A18.1 ABALON KB-1208A12.3 ... LNPK NBPF15 ATP8B4 AC005522.7 CHID1 ARFRP1 NAPB CTB-133G6.2 SPATA24 POU2F2
0.114204 0.115338 0.114985 0.106858 0.102436 0.109176 0.109893 0.109055 0.09779 0.107059 ... 0.022572 0.020735 0.022362 0.020162 0.021934 0.019642 0.021585 0.020372 0.019506 0.020419
0.106944 0.105709 0.105965 0.104501 0.104928 0.102605 0.097524 0.097808 0.10184 0.098146 ... 0.021987 0.020497 0.019832 0.018447 0.018818 0.019570 0.018344 0.020552 0.018087 0.020798
and gene_summary: gene mean gene abs(sum)
0.004425 0.027955
0.004736 0.023202
0.004456 0.022163
0.004741 0.020509
0.004347 0.017307
gwaybio commented 3 years ago

Thanks @Sithara85 - a couple things:

after disabling the eager execution to make your program work in Tensorflow 2.

Can you elaborate what your solution was? Perhaps others will see this and will be interested in knowing exactly what you had to change.

But when we use our gene expression data (which is log 2 cpm normalized data), I am getting all the reconstructed values as 1.0

The input data need to be normalized further to be in the range of 0-1. See process-data.ipynb for specific details.

Could you let me know if you can think of any issue with my input data shape. Dimension: (3045, 10956)

I recommend reducing the number of gene features you're using. In process-data.ipynb, you will also see that we reduced gene dimensions by selecting the top 5,000 most variably expressed genes.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.