rvinas / GTEx-imputation

Gene Expression Imputation with Generative Adversarial Imputation Nets
MIT License
11 stars 3 forks source link

Data in gtex_generator.py #8

Open decortja opened 3 years ago

decortja commented 3 years ago

Hi Ramon,

Ran into an issue running your code -- any insight would be greatly appreciated.

In gtex_generator.py, you read in a file called 'GTEX_data.csv' for the GTEx file, and another file of annotations called 'GTEx_Analysis_2017-06-05_v8_Annotations_SubjectPhenotypesDS.txt'. The latter is straight from the GTEx website, but where is GTEX_data.csv from? How is the data structured/what are its contents?

I ask because I tried to implement your code yesterday using the GTEX TPM file and got the error below. This error is of course because the GTEx TPM file has no "tissue" column. Hence my confusion of the contents and structure of GTEX_data.csv.

Thanks again, Joe

 File "/jdecorte/GTEx-imputation-main/imputation.py", line 87, in <module>
    model, generator = train(config)
  File "/jdecorte/GTEx-imputation-main/imputation.py", line 19, in train
    generator = get_generator(config.dataset)(pathway=config.pathway,
  File "/jdecorte/GTEx-imputation-main/data/gtex_generator.py", line 38, in __init__
    x, gene_symbols, self.sample_ids, self.tissues = GTEx(file)
  File "/jdecorte/GTEx-imputation-main/data/gtex_generator.py", line 12, in GTEx
    tissues = df['tissue'].values
  File "/jdecorte/miniconda3/lib/python3.9/site-packages/pandas/core/frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/jdecorte/miniconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'tissue'
decortja commented 3 years ago

Additionally, I could not find the PMI code explicitly reported anywhere. You mentioned in the paper that PMI worked particularly well for inductive imputing -- is PMI the inductive imputer you posted?

rvinas commented 3 years ago

Hi @decortja, thanks for reaching out.

The GTEX_data.csv file contains a pre-processed matrix of gene expression values, where rows correspond to samples and columns correspond to genes. You are right - we appended an extra column "tissue" with the tissue type of each sample.

For your information, we pre-processed the data as follows: a. Genes were selected based on expression thresholds of >=0.1 TPM in >=20% of samples and >=6 reads (unnormalized) in >=20% of samples b. Read counts were normalized between samples using TMM (Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11, R25 (2010)). c. Expression values for each gene were inverse normal transformed.

Regarding the second question - that is correct. We should probably refactor the code to make it clear that the inductive imputer in this repo corresponds to PMI in the manuscript.

I hope this is helpful.

decortja commented 3 years ago

Thanks! Very helpful. Closing but I have another issue with the PMI imputation for which I'll start a new thread.

decortja commented 3 years ago

Hey @rvinas! Commenting here because it relates to your normalization strategy above --

TMM: I wanted to verify that you filtered based on TPM >= 0.1 and reads >=6 before TMM-normalizing the libraries (not after)? Regardless, a bit confused on what values you're using after this -- To my knowledge, TMM just normalizes the library sizes (see here). Did you transform these into TPMs, CPMs, or RPKMs after TMM-library normalization? Or am I misinterpreting how TMM was used here?

INT: I assume you were inverse-normalizing the read counts and not TPMs? Do you know a good inverse-normalization tool in R or Python? I've found some code written up on Biostars, but figured I'd save the trouble if you knew off-hand. No worries if not.