Closed johannfaouzi closed 3 years ago
Indeed, the imputation seems quite bad. Do you have a histogram of the INFO scores?
I am actually releasing a preprint in ~2 weeks where I investigate the issue of using imputed data. If you email me, I can send you a current draft with possible solutions you might want to try (including refined QC and corrections for imputation quality).
Thank you very much for the quick reply!
Indeed, the imputation seems quite bad. Do you have a histogram of the INFO scores?
Here is the histogram for the 39,235,157 SNPs of the Haplotype Reference Consortium (the y-scale is logarithmic because the histogram was ugly otherwise):
Here is the histogram with only the HapMap3 SNPs:
I am actually releasing a preprint in ~2 weeks where I investigate the issue of using imputed data. If you email me, I can send you a current draft with possible solutions you might want to try (including refined QC and corrections for imputation quality).
Sure, I would be very interested! My email address is johann.faouzi@gmail.com. Thank you very much!
Indeed that's very bad.. You can probably try a threshold of 0.3-0.5.
Thank you for the advice and the draft! I will read it with great interest!
Sorry if it's not a question specific to LDpred2, but I would like to have your opinion on the trade-off between low quality imputed variants and the number of included variants to compute genetic risk scores.
In one cohort, for which I would like to compute genetic risk scores, the genotyping array (NeuroX) has a low coverage and leads to a lot of variants being imputed with mediocre quality.
The table below indicates the number of variants in HapMap3 and above the
r^2
threshold (percentage in parentheses) for each chromosome and in total. So far, I was using a very conservative threshold (0.8), but a lot of variants are excluded. The MaCH paper mentions a threshold of 0.3. A former colleague told me that he often used a threshold of 0.6. I'm a bit confused about these different values and I would like to know your opinion.I should mention that I don't have the true phenotypes for which I want to compute genetic risk scores. Another possibility that came to my mind would be to compute genetic risk scores for well-known phenotypes with large GWAS and for which I have the true phenotype (like height or body mass index) and select the threshold that gives the best results (measured by the Pearson correlation coefficient between the true phenotypes and the genetic risk scores), but I don't know if it's accurate to extrapolate to genetic risk scores for other phenotypes).
Thank you once again for your help!