jvivian / gene-outlier-detection

A Bayesian model for identifying gene expression outliers for individual single samples (N-of-1) when compared to a cohort of background datasets.
9 stars 3 forks source link

Expression Metrics and Concordance with Datasets #68

Closed eyzhao closed 4 years ago

eyzhao commented 4 years ago

Thank you for developing this useful package. I had two questions

  1. Confirming that input metric appears to be TPM, for both the GTEX data and example input that you provide in the data folder. Is this correct?
  2. I ask because when I examine gtex expression matrix of gene TPMs from https://gtexportal.org/home/datasets, I notice that the values appear to be different. For example, taking one of the first genes in that matrix, WASH7P, it is expressed as follows in the downloaded data from gtex.
Sample Expression
GTEX-1117F-0226-SM-5GZZ7 8.764
GTEX-1117F-0426-SM-5EGHI 3.861
GTEX-1117F-0526-SM-5EGHJ 7.349

But in your gtex matrix, it is

Sample Expression
GTEX-1117F-0226-SM-5GZZ7 5.907
GTEX-1117F-0426-SM-5EGHI 5.140
GTEX-1117F-0526-SM-5EGHJ 5.946

Perhaps I am misunderstanding something, or the units are not TPM?

Thanks for your time!

jvivian commented 4 years ago

Hi @eyzhao ,

Please accept my apologies, I never received a notification that this issue was opened and I'm not sure why.

Can you link exactly which GTEx file you downloaded? My starting dataframe for expression didn't come directly from GTEx, but from the UCSC Toil recompute, which likely involved different preprocessing and alignment steps than what GTEx used (at least I didn't see their process on that page).

In the data folder, the GTEx and TCGA data should have values that correspond to: np.log2(TPM + 1). I started with the data frames available on Xena, which use log2(TPM + 0.001). I transformed those values back to TPM, confirmed they summed to ~1 million, then applied the np.log2(TPM + 1) transformation.

Please let me know if that does not answer your questions and in the future feel free to email me directly if you do not receive a sufficiently prompt response.