Script for converting data from plink to dosage format now works correctly.

sritchie73 commented 9 years ago

Imputes missing data (SNPs with missing data for some, but not all samples) to 2 * the minor allele frequency of that SNP as per discussion in pull request #11

sritchie73 commented 9 years ago

It looks like its performing pretty well now that the conversion from plink to the dosage format imputes missing genotype data (and now that I don't have the allele columns switched).

Is the heritability of each gene available somewhere? It would be nice to sanity check that the correlation between the predicted and actual expression is high for genes with high heritability.

Also I see a small number of genes with strong negative correlation between the predicted and actual gene expression. Is this something you've seen before?

hakyim commented 9 years ago

It looks like its performing pretty well now that the conversion from plink to the dosage format imputes missing genotype data (and now that I don't have the allele columns switched).

plink uses different convention than we (and others) do in the sense that the dosage refers to the number of the first allele. Do you mean that you took this into account?

Is the heritability of each gene available somewhere? It would be nice to sanity check that the correlation between the predicted and actual expression is high for genes with high heritability.

We do find that R2 and h2 are in sync, see Figure 3 in the PrediXcan paper. Red dots (R2) more or less follow black curve (h2). We'll make h2 available soon.

Also I see a small number of genes with strong negative correlation between the predicted and actual gene expression. Is this something you've seen before?

Which expression data are you seeing this negative correlation?

sritchie73 commented 9 years ago

It looks like its performing pretty well now that the conversion from plink to the dosage format imputes missing genotype data (and now that I don't have the allele columns switched).

plink uses different convention than we (and others) do in the sense that the dosage refers to the number of the first allele. Do you mean that you took this into account?

Yes: in the first pull request I had the columns the wrong way around. This has been fixed.

Is the heritability of each gene available somewhere? It would be nice to sanity check that the correlation between the predicted and actual expression is high for genes with high heritability.

We do find that R2 and h2 are in sync, see Figure 3 in the PrediXcan paper. Red dots (R2) more or less follow black curve (h2). We'll make h2 available soon.

Awesome. The R2 distribution I see is roughly the same shape as the h2 distribution in Figure 3.

Also I see a small number of genes with strong negative correlation between the predicted and actual gene expression. Is this something you've seen before?

Which expression data are you seeing this negative correlation?

I see a negative correlation (< -0.1) for 134 genes in the DILGOM cohort: a random sample (N = 518) of the working-age (24y – 75y) Finnish population in Helsinki.

hakyim commented 9 years ago

Is this whole blood expression?

On Thursday, October 29, 2015, Scott Ritchie notifications@github.com wrote:

It looks like its performing pretty well now that the conversion from plink to the dosage format imputes missing genotype data (and now that I don't have the allele columns switched).

plink uses different convention than we (and others) do in the sense that the dosage refers to the number of the first allele. Do you mean that you took this into account?

Yes: in the first pull request I had the columns the wrong way around. This has been fixed.

Is the heritability of each gene available somewhere? It would be nice to sanity check that the correlation between the predicted and actual expression is high for genes with high heritability.

We do find that R2 and h2 are in sync, see Figure 3 in the PrediXcan paper. Red dots (R2) more or less follow black curve (h2). We'll make h2 available soon.

Awesome. The R2 distribution I see is roughly the same shape as the h2 distribution in Figure 3.

Also I see a small number of genes with strong negative correlation between the predicted and actual gene expression. Is this something you've seen before?

Which expression data are you seeing this negative correlation?

I see a negative correlation (< -0.1) for 134 genes in the DILGOM cohort: a random sample (N = 518) of the working-age (24y – 75y) Finnish population in Helsinki.

— Reply to this email directly or view it on GitHub https://github.com/hakyimlab/PrediXcan/pull/12#issuecomment-152354450.

sritchie73 commented 9 years ago

Ah yes, forgot that part. It's whole blood.

sritchie73 commented 9 years ago

I've dug into this a bit, and there seems to be a mix of things happening here.

I've imputed with the GTEx Whole Blood model as well now. I also see a similar number of genes whose predicted expression is negatively correlated with the actual expression, but for (mostly) different genes.

Most of these genes have a weak correlation coefficient (between -0.1 and 0.1) with the actual expression for both prediction models, so are probably those with low-heritability and low predictive power.

There are 37 genes whose predicted expression are negatively correlated < -0.1 with the actual expression when using both the DGN and GTEx whole blood models. I'm guessing these are due to population effects (SNPs for which the Finnish minor allele is the major allele in both the GTEx and DGN populations?) or coding errors in the genotype data.

Interestingly, (but perhaps not surprisingly) it also seems the DGN model is confounded by SNPs associated with / genes differentially expressed in individuals with major depressive disorder (if I understand the cohort correctly). For example, the predicted expression of the MED4 gene from the DGN model has a Pearson correlation of -0.7 with the actual gene expression, but a correlation of 0.05 with the actual gene expression when predicted from the GTEx whole blood model. There were also 39 other genes where this might be happening, but were not in the predicted expression table when imputing from the GTEx model.

The inverse is also true for the GTEx model. The are 13 genes whose predicted expression negatively correlated < -0.1 with the actual expression, but when predicted with the DGN model are positively correlated > 0.1 with the actual expression. I have no explanation for this discordance.

So it seems safest to impute using both models, and taking genes whose predicted expression is highly correlated across both models.

hakyim commented 9 years ago

This is all very interesting.

It would help to compute the p value of the spearman correlation to weed out non significant and also outlier driven correlations.

Regarding the minor/major allele, they should not matter as long as the dosage/other alleles are well coded. LD differences between the cohorts could explain it. For example, if there is a causal variant that is positively correlated with the model SNP in one pop and negative in the other.

How did you compute prediction with both models? Did you just average?

A correlation of -0.7 seems quite extreme. Do you know that MED4 is differentially expressed? My understanding was that there was not much found in terms of diff expression in this dataset but I may be wrong. MED4 heritability was estimated as 0.15 in DGN.

I was preparing the h2 values from DGN to suggest to you to check the h2 of the genes you are finding negatively correlated: https://app.box.com/s/ew0v8pn76i9p91c7iv3rzstld3dfiecf

On Thu, Oct 29, 2015 at 8:04 PM, Scott Ritchie notifications@github.com wrote:

I've dug into this a bit, and there seems to be a mix of things happening here.

I've imputed with the GTEx Whole Blood model as well now. I also see a similar number of genes whose predicted expression is negatively correlated with the actual expression, but for (mostly) different genes.

Most of these genes have a weak correlation coefficient (between -0.1 and 0.1) with the actual expression for both prediction models, so are probably those with low-heritability and low predictive power.

There are 37 genes whose predicted expression are negatively correlated < -0.1 with the actual expression when using both the DGN and GTEx whole blood models. I'm guessing these are due to population effects (SNPs for which the Finnish minor allele is the major allele in both the GTEx and DGN populations?) or coding errors in the genotype data.

Interestingly, (but perhaps not surprisingly) it also seems the DGN model is confounded by SNPs associated with / genes differentially expressed in individuals with major depressive disorder (if I understand the cohort correctly). For example, the predicted expression of the MED4 gene from the DGN model has a Pearson correlation of -0.7 with the actual gene expression, but a correlation of 0.05 with the actual gene expression when predicted from the GTEx whole blood model. There were also 39 other genes where this might be happening, but were not in the predicted expression table when imputing from the GTEx model.

— Reply to this email directly or view it on GitHub https://github.com/hakyimlab/PrediXcan/pull/12#issuecomment-152371555.

hakyim commented 9 years ago

I should also note that we found a bug in our prediction models, thanks to Craig Glastonbury. We were scaling the dosages when generating the models but did not correct the weights to be used with unscaled dosages. We are correcting and posting new models shortly. But this is highly unlikely to create real negative correlations.

sritchie73 commented 9 years ago

I've generated some figures showing Spearman correlation between imputed and observed vs. h2 for both models, as well as model concordance (Spearman correlation between expression imputed using the DGN and GTEx models), and model concordance vs. h2.

Re: MED4. I'm not at all familiar with the literature on major depressive disorder. A cursory search indicates that it is not (nor are any cis-SNPs).

My reasoning was that by imputing with both models, genes with negative correlation in both models could be attributable to population structure or coding errors, while genes with negative correlation in one model but not the other could be attributed to the model itself capturing variation intrinsic to the training cohort.

Let me know when the new models are available and I'll run the imputation again.

hakyim commented 9 years ago

Hi Scott I meant to reply to this long ago...

This is very useful although I still don't know what to make of the sign flips. Some may be due to artifacts but are any of those real? Thank you for sending the plots. Maybe we can discuss via email? My email is haky at uchicago dot edu

Thanks Haky

JYE-Lee commented 7 years ago

Hi, I am new with PrediXcan and found this helpful script! I tried to convert my plink data into dosage format using command as below. "plink python convert_plink_to_dosages.py -b mydata -o test"

And an error message shows up like "IOError: [Errno 2] No such file or directory: 'test.traw'"

I understand that files I need are this script and plink files, but what is ".traw" file?

hakyim commented 7 years ago

Scott, could you take a look at this?

Thanks Haky

---------- Forwarded message ---------- From: JYE-Lee notifications@github.com Date: Fri, Jan 6, 2017 at 4:20 AM Subject: Re: [hakyimlab/PrediXcan] Script for converting data from plink to dosage format now works correctly. (#12) To: hakyimlab/PrediXcan PrediXcan@noreply.github.com Cc: Hae Kyung Im hakyim@gmail.com, State change < state_change@noreply.github.com>

Hi, I am new with PrediXcan and found this helpful script! I tried to convert my plink data into dosage format using command as below. "plink python convert_plink_to_dosages.py -b mydata -o test"

And an error message shows up like "IOError: [Errno 2] No such file or directory: 'test.traw'"

I understand that files I need are this script and plink files, but what is ".traw" file?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/hakyimlab/PrediXcan/pull/12#issuecomment-270874036, or mute the thread https://github.com/notifications/unsubscribe-auth/AC2oua2IY1nwvhKFGx6DJowZ5BHWfZT2ks5rPhVZgaJpZM4GXE52 .

ScottPDickinson commented 7 years ago

The test.traw file should be created as a temporary file while the script is running, no need to have one as an input. Please make sure you specify the path to your plink executable as an argument. You command should look something like:

python convert_plink_to_dosage.py -p /path/to/plink -b mydata -o test

If you're you not sure where you have plink installed, you can run which plink from the terminal to get the path. Hope this helps.

JYE-Lee commented 7 years ago

Thanks, it was an error due to plink version. Now it works!

hakyimlab / PrediXcan

Script for converting data from plink to dosage format now works correctly. #12