Trade-off between low quality imputed variants and the number of included variants

johannfaouzi commented 3 years ago

Sorry if it's not a question specific to LDpred2, but I would like to have your opinion on the trade-off between low quality imputed variants and the number of included variants to compute genetic risk scores.

In one cohort, for which I would like to compute genetic risk scores, the genotyping array (NeuroX) has a low coverage and leads to a lot of variants being imputed with mediocre quality.

The table below indicates the number of variants in HapMap3 and above the r^2 threshold (percentage in parentheses) for each chromosome and in total. So far, I was using a very conservative threshold (0.8), but a lot of variants are excluded. The MaCH paper mentions a threshold of 0.3. A former colleague told me that he often used a threshold of 0.6. I'm a bit confused about these different values and I would like to know your opinion.

I should mention that I don't have the true phenotypes for which I want to compute genetic risk scores. Another possibility that came to my mind would be to compute genetic risk scores for well-known phenotypes with large GWAS and for which I have the true phenotype (like height or body mass index) and select the threshold that gives the best results (measured by the Pearson correlation coefficient between the true phenotypes and the genetic risk scores), but I don't know if it's accurate to extrapolate to genetic risk scores for other phenotypes).

Thank you once again for your help!

	R2 > 0.1	R2 > 0.2	R2 > 0.3	R2 > 0.4	R2 > 0.5	R2 > 0.6	R2 > 0.7	R2 > 0.8	R2 > 0.9
chr 1	69674 (79.95%)	52368 (60.09%)	40538 (46.52%)	31403 (36.04%)	24103 (27.66%)	18290 (20.99%)	13417 (15.40%)	9068 (10.41%)	4733 (5.43%)
chr 2	64160 (73.31%)	45472 (51.96%)	33856 (38.68%)	25412 (29.04%)	19039 (21.75%)	14001 (16.00%)	9989 (11.41%)	6689 (7.64%)	3689 (4.21%)
chr 3	54013 (73.57%)	37367 (50.90%)	27417 (37.35%)	20593 (28.05%)	16022 (21.82%)	12315 (16.77%)	9195 (12.52%)	6303 (8.59%)	3416 (4.65%)
chr 4	43668 (66.68%)	28094 (42.90%)	20118 (30.72%)	15160 (23.15%)	11574 (17.67%)	8790 (13.42%)	6555 (10.01%)	4490 (6.86%)	2530 (3.86%)
chr 5	48230 (72.71%)	33301 (50.21%)	23828 (35.92%)	17776 (26.80%)	13564 (20.45%)	10245 (15.45%)	7437 (11.21%)	5019 (7.57%)	2691 (4.06%)
chr 6	62362 (87.78%)	47351 (66.65%)	36413 (51.25%)	29308 (41.25%)	24035 (33.83%)	19828 (27.91%)	16146 (22.73%)	12797 (18.01%)	9664 (13.60%)
chr 7	41819 (72.06%)	27914 (48.10%)	19333 (33.31%)	14026 (24.17%)	10349 (17.83%)	7620 (13.13%)	5435 (9.36%)	3618 (6.23%)	1931 (3.33%)
chr 8	41587 (72.85%)	27461 (48.11%)	19260 (33.74%)	13494 (23.64%)	9663 (16.93%)	6756 (11.84%)	4647 (8.14%)	2936 (5.14%)	1497 (2.62%)
chr 9	33609 (69.33%)	22131 (45.65%)	15831 (32.66%)	11556 (23.84%)	8772 (18.10%)	6746 (13.92%)	5038 (10.39%)	3527 (7.28%)	2053 (4.24%)
chr 10	38566 (68.20%)	26880 (47.53%)	20200 (35.72%)	15506 (27.42%)	11566 (20.45%)	8527 (15.08%)	6081 (10.75%)	4103 (7.26%)	2206 (3.90%)
chr 11	43176 (80.01%)	32183 (59.64%)	25401 (47.07%)	20314 (37.64%)	16312 (30.23%)	12923 (23.95%)	10120 (18.75%)	7285 (13.50%)	4244 (7.86%)
chr 12	38124 (74.04%)	26637 (51.73%)	19930 (38.71%)	15502 (30.11%)	12165 (23.63%)	9448 (18.35%)	7085 (13.76%)	4740 (9.21%)	2634 (5.12%)
chr 13	26527 (66.10%)	16261 (40.52%)	10832 (26.99%)	7657 (19.08%)	5373 (13.39%)	3689 (9.19%)	2422 (6.04%)	1476 (3.68%)	788 (1.96%)
chr 14	26484 (75.47%)	18867 (53.76%)	13242 (37.73%)	9914 (28.25%)	7362 (20.98%)	5457 (15.55%)	3838 (10.94%)	2483 (7.08%)	1487 (4.24%)
chr 15	20996 (66.09%)	15369 (48.38%)	12080 (38.03%)	9944 (31.30%)	8026 (25.26%)	6252 (19.68%)	4671 (14.70%)	3352 (10.55%)	1836 (5.78%)
chr 16	18487 (57.00%)	12758 (39.34%)	9785 (30.17%)	7843 (24.18%)	6389 (19.70%)	5151 (15.88%)	4066 (12.54%)	2903 (8.95%)	1679 (5.18%)
chr 17	20608 (71.63%)	15559 (54.08%)	12918 (44.90%)	11143 (38.73%)	9524 (33.10%)	7990 (27.77%)	6465 (22.47%)	4860 (16.89%)	2933 (10.19%)
chr 18	18177 (57.71%)	11620 (36.89%)	8124 (25.79%)	5909 (18.76%)	4272 (13.56%)	3074 (9.76%)	2122 (6.74%)	1353 (4.30%)	756 (2.40%)
chr 19	16898 (85.33%)	14146 (71.43%)	11688 (59.02%)	9449 (47.71%)	7384 (37.29%)	5906 (29.82%)	4644 (23.45%)	3460 (17.47%)	2034 (10.27%)
chr 20	16905 (60.97%)	11566 (41.71%)	8852 (31.92%)	6712 (24.21%)	5209 (18.79%)	4164 (15.02%)	3131 (11.29%)	2180 (7.86%)	1166 (4.21%)
chr 21	8917 (59.00%)	6216 (41.13%)	4584 (30.33%)	3357 (22.21%)	2516 (16.65%)	1853 (12.26%)	1268 (8.39%)	844 (5.58%)	493 (3.26%)
chr 22	9639 (62.39%)	7431 (48.10%)	6121 (39.62%)	5171 (33.47%)	4370 (28.29%)	3578 (23.16%)	2817 (18.23%)	2021 (13.08%)	1079 (6.98%)
Total	762626 (72.33%)	536952 (50.93%)	400351 (37.97%)	307149 (29.13%)	237589 (22.53%)	182603 (17.32%)	136589 (12.96%)	95507 (9.06%)	55539 (5.27%)

privefl commented 3 years ago

Indeed, the imputation seems quite bad. Do you have a histogram of the INFO scores?

I am actually releasing a preprint in ~2 weeks where I investigate the issue of using imputed data. If you email me, I can send you a current draft with possible solutions you might want to try (including refined QC and corrections for imputation quality).

johannfaouzi commented 3 years ago

Thank you very much for the quick reply!

Indeed, the imputation seems quite bad. Do you have a histogram of the INFO scores?

Here is the histogram for the 39,235,157 SNPs of the Haplotype Reference Consortium (the y-scale is logarithmic because the histogram was ugly otherwise):

Here is the histogram with only the HapMap3 SNPs:

I am actually releasing a preprint in ~2 weeks where I investigate the issue of using imputed data. If you email me, I can send you a current draft with possible solutions you might want to try (including refined QC and corrections for imputation quality).

Sure, I would be very interested! My email address is johann.faouzi@gmail.com. Thank you very much!

privefl commented 3 years ago

Indeed that's very bad.. You can probably try a threshold of 0.3-0.5.

johannfaouzi commented 3 years ago

Thank you for the advice and the draft! I will read it with great interest!

privefl / bigsnpr

Trade-off between low quality imputed variants and the number of included variants #258