Assessing the quality of the GRS when using ldpred2-auto (unassessed phenotypes)

johannfaouzi commented 2 years ago

I'm using ldpred2-auto to compute genetic risk scores for phenotypes that are not assessed and I wanted to perform some sort of sanity check on the computed GRS. Therefore, I computed GRS for easy-to-assess phenotypes (height, body mass index) for which large GWAS were performed using the same methodology (same script).

I used:

(Yengo et al, 2018) for height (explained variance: h^2 = 0.483)
(Pulit et al, 2019) for body mass index (explained variance: h^2 = 0.279)

If I'm not mistaken, the explained variance is the square of the Pearson correlation coefficient, so I computed the correlation coefficients on my data set and I get:

For height: 0.435676
For body mass index: 0.289705

If I take the square of these coefficients, it's (much) lower than the explained variance reported in these studies. There might be many reasons to justify these differences:

smaller set of SNPs in HapMap3 than in the GWAS,
the auto method of LDpred2 which is unsupervised (unsupervised learning is harder than supervised learning),
the quality of the individual-level genetic data.

I was wondering what your opinion is based on these values. Are the Pearson correlation coefficients too low and it's likely that there is an error in my code, or are they high enough to suggest that my scripts are sound?

Thank you very much once again for your help!

privefl commented 2 years ago

h2 is usually the SNP heritability, which should always be larger or equal than the r2, the explained variance of the PGS. From the top my head, I would say that SNP-h2 for height is 45-55% and 28-35% for BMI. r2 in the UKBB would be something like 35-42% for height and 10-15% for BMI. Also, you need to actually compute the partial correlation, not just the correlation, e.g. with bigstatsr::pcor(), since e.g. sex is explaining a lot for height.

The auto method is not really an unsupervised learning method. Unsupervised means no phenotype, but there is a phenotype when you compute the GWAS sumstats. It is just that it automatically finds values for the hyper-parameters instead of having to try a grid and choose the best model in a validation set.

johannfaouzi commented 2 years ago

Thank you for your answer and your valuable feedback!

I computed the partial correlations (correcting for sex, should I add other covariates?) and obtained the following values for R2:

For height: 31.33%
For BMI: 9.66%

So the values are slightly lower than the ones that you mentioned but still quite close. Do you think that these values are still high enough to have good confidence in the quality of the GRS that I computed?

privefl commented 2 years ago

In my latest paper, I use sex, 16 PCs, age, birth date and deprivation index I think. But sex should be the main driver for height.

What is the sample size you used to derive the sumstats? I'm talking about R2 for at least N=360K in training.

johannfaouzi commented 2 years ago

Thank you for your feedback!

Both GWAS are meta-analyses that include UK BioBank data. The maximum sample sizes in the training sets are 806,833 for BMI and 709,703 for height.

My data is a set of Parkinson's disease cohorts, so the data is longitudinal and the number of subjects is small (a few hundred subjects in each cohort), so I added the following covariates:

sex
4 PCs
Age at baseline
Birth date (year)

and I get the following R2 values: 38.66% for height and 10.55% for BMI.

privefl commented 2 years ago

Seems okay. What are the CIs?

johannfaouzi commented 2 years ago

For height: r = 0.621817 and 95%CI = [0.56, 0.68]
For BMI: r = 0.324819 and 95%CI = [0.23, 0.41]

I only have 383 subjects in this cohort so the CI are relatively large.

privefl commented 2 years ago

Seems okay.

johannfaouzi commented 2 years ago

Ok, thank you very much!

privefl / bigsnpr

Assessing the quality of the GRS when using ldpred2-auto (unassessed phenotypes) #264