privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
192 stars 44 forks source link

Trade-off between low quality imputed variants and the number of included variants #258

Closed johannfaouzi closed 3 years ago

johannfaouzi commented 3 years ago

Sorry if it's not a question specific to LDpred2, but I would like to have your opinion on the trade-off between low quality imputed variants and the number of included variants to compute genetic risk scores.

In one cohort, for which I would like to compute genetic risk scores, the genotyping array (NeuroX) has a low coverage and leads to a lot of variants being imputed with mediocre quality.

The table below indicates the number of variants in HapMap3 and above the r^2 threshold (percentage in parentheses) for each chromosome and in total. So far, I was using a very conservative threshold (0.8), but a lot of variants are excluded. The MaCH paper mentions a threshold of 0.3. A former colleague told me that he often used a threshold of 0.6. I'm a bit confused about these different values and I would like to know your opinion.

I should mention that I don't have the true phenotypes for which I want to compute genetic risk scores. Another possibility that came to my mind would be to compute genetic risk scores for well-known phenotypes with large GWAS and for which I have the true phenotype (like height or body mass index) and select the threshold that gives the best results (measured by the Pearson correlation coefficient between the true phenotypes and the genetic risk scores), but I don't know if it's accurate to extrapolate to genetic risk scores for other phenotypes).

Thank you once again for your help!

R2 > 0.1 R2 > 0.2 R2 > 0.3 R2 > 0.4 R2 > 0.5 R2 > 0.6 R2 > 0.7 R2 > 0.8 R2 > 0.9
chr 1 69674 (79.95%) 52368 (60.09%) 40538 (46.52%) 31403 (36.04%) 24103 (27.66%) 18290 (20.99%) 13417 (15.40%) 9068 (10.41%) 4733 (5.43%)
chr 2 64160 (73.31%) 45472 (51.96%) 33856 (38.68%) 25412 (29.04%) 19039 (21.75%) 14001 (16.00%) 9989 (11.41%) 6689 (7.64%) 3689 (4.21%)
chr 3 54013 (73.57%) 37367 (50.90%) 27417 (37.35%) 20593 (28.05%) 16022 (21.82%) 12315 (16.77%) 9195 (12.52%) 6303 (8.59%) 3416 (4.65%)
chr 4 43668 (66.68%) 28094 (42.90%) 20118 (30.72%) 15160 (23.15%) 11574 (17.67%) 8790 (13.42%) 6555 (10.01%) 4490 (6.86%) 2530 (3.86%)
chr 5 48230 (72.71%) 33301 (50.21%) 23828 (35.92%) 17776 (26.80%) 13564 (20.45%) 10245 (15.45%) 7437 (11.21%) 5019 (7.57%) 2691 (4.06%)
chr 6 62362 (87.78%) 47351 (66.65%) 36413 (51.25%) 29308 (41.25%) 24035 (33.83%) 19828 (27.91%) 16146 (22.73%) 12797 (18.01%) 9664 (13.60%)
chr 7 41819 (72.06%) 27914 (48.10%) 19333 (33.31%) 14026 (24.17%) 10349 (17.83%) 7620 (13.13%) 5435 (9.36%) 3618 (6.23%) 1931 (3.33%)
chr 8 41587 (72.85%) 27461 (48.11%) 19260 (33.74%) 13494 (23.64%) 9663 (16.93%) 6756 (11.84%) 4647 (8.14%) 2936 (5.14%) 1497 (2.62%)
chr 9 33609 (69.33%) 22131 (45.65%) 15831 (32.66%) 11556 (23.84%) 8772 (18.10%) 6746 (13.92%) 5038 (10.39%) 3527 (7.28%) 2053 (4.24%)
chr 10 38566 (68.20%) 26880 (47.53%) 20200 (35.72%) 15506 (27.42%) 11566 (20.45%) 8527 (15.08%) 6081 (10.75%) 4103 (7.26%) 2206 (3.90%)
chr 11 43176 (80.01%) 32183 (59.64%) 25401 (47.07%) 20314 (37.64%) 16312 (30.23%) 12923 (23.95%) 10120 (18.75%) 7285 (13.50%) 4244 (7.86%)
chr 12 38124 (74.04%) 26637 (51.73%) 19930 (38.71%) 15502 (30.11%) 12165 (23.63%) 9448 (18.35%) 7085 (13.76%) 4740 (9.21%) 2634 (5.12%)
chr 13 26527 (66.10%) 16261 (40.52%) 10832 (26.99%) 7657 (19.08%) 5373 (13.39%) 3689 (9.19%) 2422 (6.04%) 1476 (3.68%) 788 (1.96%)
chr 14 26484 (75.47%) 18867 (53.76%) 13242 (37.73%) 9914 (28.25%) 7362 (20.98%) 5457 (15.55%) 3838 (10.94%) 2483 (7.08%) 1487 (4.24%)
chr 15 20996 (66.09%) 15369 (48.38%) 12080 (38.03%) 9944 (31.30%) 8026 (25.26%) 6252 (19.68%) 4671 (14.70%) 3352 (10.55%) 1836 (5.78%)
chr 16 18487 (57.00%) 12758 (39.34%) 9785 (30.17%) 7843 (24.18%) 6389 (19.70%) 5151 (15.88%) 4066 (12.54%) 2903 (8.95%) 1679 (5.18%)
chr 17 20608 (71.63%) 15559 (54.08%) 12918 (44.90%) 11143 (38.73%) 9524 (33.10%) 7990 (27.77%) 6465 (22.47%) 4860 (16.89%) 2933 (10.19%)
chr 18 18177 (57.71%) 11620 (36.89%) 8124 (25.79%) 5909 (18.76%) 4272 (13.56%) 3074 (9.76%) 2122 (6.74%) 1353 (4.30%) 756 (2.40%)
chr 19 16898 (85.33%) 14146 (71.43%) 11688 (59.02%) 9449 (47.71%) 7384 (37.29%) 5906 (29.82%) 4644 (23.45%) 3460 (17.47%) 2034 (10.27%)
chr 20 16905 (60.97%) 11566 (41.71%) 8852 (31.92%) 6712 (24.21%) 5209 (18.79%) 4164 (15.02%) 3131 (11.29%) 2180 (7.86%) 1166 (4.21%)
chr 21 8917 (59.00%) 6216 (41.13%) 4584 (30.33%) 3357 (22.21%) 2516 (16.65%) 1853 (12.26%) 1268 (8.39%) 844 (5.58%) 493 (3.26%)
chr 22 9639 (62.39%) 7431 (48.10%) 6121 (39.62%) 5171 (33.47%) 4370 (28.29%) 3578 (23.16%) 2817 (18.23%) 2021 (13.08%) 1079 (6.98%)
Total 762626 (72.33%) 536952 (50.93%) 400351 (37.97%) 307149 (29.13%) 237589 (22.53%) 182603 (17.32%) 136589 (12.96%) 95507 (9.06%) 55539 (5.27%)
privefl commented 3 years ago

Indeed, the imputation seems quite bad. Do you have a histogram of the INFO scores?

I am actually releasing a preprint in ~2 weeks where I investigate the issue of using imputed data. If you email me, I can send you a current draft with possible solutions you might want to try (including refined QC and corrections for imputation quality).

johannfaouzi commented 3 years ago

Thank you very much for the quick reply!

Indeed, the imputation seems quite bad. Do you have a histogram of the INFO scores?

Here is the histogram for the 39,235,157 SNPs of the Haplotype Reference Consortium (the y-scale is logarithmic because the histogram was ugly otherwise):

Capture d’écran 2021-09-04 à 20 12 16

Here is the histogram with only the HapMap3 SNPs:

Capture d’écran 2021-09-04 à 20 12 24

I am actually releasing a preprint in ~2 weeks where I investigate the issue of using imputed data. If you email me, I can send you a current draft with possible solutions you might want to try (including refined QC and corrections for imputation quality).

Sure, I would be very interested! My email address is johann.faouzi@gmail.com. Thank you very much!

privefl commented 3 years ago

Indeed that's very bad.. You can probably try a threshold of 0.3-0.5.

johannfaouzi commented 3 years ago

Thank you for the advice and the draft! I will read it with great interest!