ERROR: VAE Model reconstruct the gene expression data

Sithara85 commented 3 years ago

Hi Gregory,

I am very happy to see how you have detailed the steps for gene expression VAE based model. I am doing some analysis on gene expression prediction model to classify dementia. I started learning teh applications of VAE model/ machine learning models in omic prediction models. Also I am new to Tensorflow/Keras.

I successfully implemented your code using your gene expression data after disabling the eager execution to make your program work in Tensorflow 2. But when we use our gene expression data (which is log 2 cpm normalized data), I am getting all the reconstructed values as 1.0 so my gene_mean and gene_summary remains same. I evaluated your data distribution ( it looks gene expression data in the range of 0-1).

Could you let me know if you can think of any issue with my input data shape.

Input:

Dimension: (3045, 10956) data:

XXbac-BPG248L24.12	TTN	RP11-290D2.6	JSRP1	RP11-115D19.1	HCG4P5	AC114271.2	RP3-394A18.1	ABALON	KB-1208A12.3	...	LNPK	NBPF15	ATP8B4	AC005522.7	CHID1	ARFRP1	NAPB	CTB-133G6.2	SPATA24	POU2F2
12.988314	12.83923	12.576047	12.136305	11.978625	12.494600	12.583452	12.415211	10.807796	11.77091	...	3.774915	3.350120	3.774915	2.310416	1.684627	2.745442	3.350120	2.745442	2.310416	3.350120
12.641275	12.56024	12.252506	12.576744	12.883745	12.327777	11.751295	11.776363	12.345054	11.64166	...	2.240370	2.953526	3.618031

input_rnaseq_reconstruct.head(2):

XXbac-BPG248L24.12	TTN	RP11-290D2.6	JSRP1	RP11-115D19.1	HCG4P5	AC114271.2	RP3-394A18.1	ABALON	KB-1208A12.3	...	LNPK	NBPF15	ATP8B4	AC005522.7	CHID1	ARFRP1	NAPB	CTB-133G6.2	SPATA24	POU2F2
1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0

gene_summary:

gene mean	gene abs(sum)
11.867037	11.867037
11.774692	11.774692
11.622958	11.622958
11.489614	11.489614
11.474622	11.474622

Sithara85 commented 3 years ago

I just tries a trick on the data. As I have seen Gregory's dataset had all gene expression distribution ranges from 0-1 so I have multiplied my log2cpm data with 0.01 so all values ranges from 0-1. Now I have better results to visualize, So I am wondering where we are setting the input tensors to range from 0-1.

Sithara85 commented 3 years ago

New input_rnaseq_reconstruct:

XXbac-BPG248L24.12	TTN	RP11-290D2.6	JSRP1	RP11-115D19.1	HCG4P5	AC114271.2	RP3-394A18.1	ABALON	KB-1208A12.3	...	LNPK	NBPF15	ATP8B4	AC005522.7	CHID1	ARFRP1	NAPB	CTB-133G6.2	SPATA24	POU2F2
0.114204	0.115338	0.114985	0.106858	0.102436	0.109176	0.109893	0.109055	0.09779	0.107059	...	0.022572	0.020735	0.022362	0.020162	0.021934	0.019642	0.021585	0.020372	0.019506	0.020419
0.106944	0.105709	0.105965	0.104501	0.104928	0.102605	0.097524	0.097808	0.10184	0.098146	...	0.021987	0.020497	0.019832	0.018447	0.018818	0.019570	0.018344	0.020552	0.018087	0.020798

and gene_summary: gene mean	gene abs(sum)
0.004425	0.027955
0.004736	0.023202
0.004456	0.022163
0.004741	0.020509
0.004347	0.017307

gwaybio commented 3 years ago

Thanks @Sithara85 - a couple things:

after disabling the eager execution to make your program work in Tensorflow 2.

Can you elaborate what your solution was? Perhaps others will see this and will be interested in knowing exactly what you had to change.

But when we use our gene expression data (which is log 2 cpm normalized data), I am getting all the reconstructed values as 1.0

The input data need to be normalized further to be in the range of 0-1. See process-data.ipynb for specific details.

Could you let me know if you can think of any issue with my input data shape. Dimension: (3045, 10956)

I recommend reducing the number of gene features you're using. In process-data.ipynb, you will also see that we reduced gene dimensions by selecting the top 5,000 most variably expressed genes.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

greenelab / tybalt

ERROR: VAE Model reconstruct the gene expression data #153

XXbac-BPG248L24.12	TTN	RP11-290D2.6	JSRP1	RP11-115D19.1	HCG4P5	AC114271.2	RP3-394A18.1	ABALON	KB-1208A12.3	...	LNPK	NBPF15	ATP8B4	AC005522.7	CHID1	ARFRP1	NAPB	CTB-133G6.2	SPATA24	POU2F2
1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0

XXbac-BPG248L24.12	TTN	RP11-290D2.6	JSRP1	RP11-115D19.1	HCG4P5	AC114271.2	RP3-394A18.1	ABALON	KB-1208A12.3	...	LNPK	NBPF15	ATP8B4	AC005522.7	CHID1	ARFRP1	NAPB	CTB-133G6.2	SPATA24	POU2F2
1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0

XXbac-BPG248L24.12	TTN	RP11-290D2.6	JSRP1	RP11-115D19.1	HCG4P5	AC114271.2	RP3-394A18.1	ABALON	KB-1208A12.3	...	LNPK	NBPF15	ATP8B4	AC005522.7	CHID1	ARFRP1	NAPB	CTB-133G6.2	SPATA24	POU2F2
1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0