privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
190 stars 44 forks source link

Comparing ldpred1 and ldpred2 results for non-European ancestries #question #230

Closed bbitarello closed 3 years ago

bbitarello commented 3 years ago

Hi Florian,

I have been transitioning from ldpred1 to ldpred2 and as a sanity check I have been running a set of analyses on the exact same datasets. The summary statistics are from the UK Biobank and so is the LD reference panel (Eur ancestry). The test datasets include: one of European ancestry, one of admixed African and European ancestry, and one of (mostly) East Asian ancestry. For ldpred1 I used --ldr 1000 (which is what I had been using for a while) and for ldpred2 I am using the recommended window of 3 cM (size=3/1000 in the snp_cor function). Below is a summary of the comparisons. basically, for EUR ancestry estimates are highly correlated between ldpred1 and ldpred2, but not so for non-EUR ancestries. Questions: is this discrepancy expected? If so, why? Thank you

Eur ancestry

correlation between ldpred1 and ldpred2 PRS: 0.959 prediction accuracy: ldpred1 ldpred2 23.2% 23.3%

Afr+Eur ancestry

correlation between ldpred1 and ldpred2 PRS: 0.81 prediction accuracy: ldpred1 ldpred2 4.2%. 1.6%

East Asian ancestry:

correlation between ldpred1 and ldpred2 PRS: 0.653 prediction accuracy: ldpred1 ldpred2 8.7% 2%

PS: I tried running snp_cor in ldpred2 with size=1000 for comparison but that seemed computationally prohibitive.

privefl commented 3 years ago
bvilhjal commented 3 years ago

Hi Barbara,

Thanks for doing this analysis. The results are not expected, and therefore also very interesting!

Assuming this isn't a bug, I hypothesize that difference in LD window sizes is what could be explaining this. The LD window used by ldpred1 is smaller than the one ldpred2 uses, and I suspect that could be a good thing when doing cross-ancestry prediction. The reason is that ldpred1/2 will in practice spread the weights among correlated variants, according to the LD. Under the ldpred1/2 model, this is the optimal thing to do. However, if we assume there are some untyped variants, or non-additive effects that are being tagged by some haplotypes in LD, then this may not always be optimal if LD changes between training and test data. This is obviously the case when training PRS on individuals of European ancestry and predicting into individuals of non-European ancestry. A large LD window might then spread the weights too widely, making the resulting PRS less robust/portable when doing cross-ancestry prediction. Hence, a smaller LD window, which ldpred1 uses compared to ldpred2, might end up being beneficial, and the faster the LD breaks down in the target sample, the smaller window you might want to use.

So, you can probably test this hypothesis by trying a smaller LD window in ldpred2.

Best, Bjarni

ps. Another and potentially more powerful approach would be to determine the window based on how much the LD in each region of the genome actually changes between training and target sample.

privefl commented 3 years ago

Yes, that would indeed be a very interesting result.

You can try with 0.5 cM instead maybe. You might also need to apply a bit more regularization (by trying hyper-parameter values for h2 that are very small, e.g. also 0.1 h2_ldsc along with the usual 0.7, 1, and 1.4 h2_ldsc).

bbitarello commented 3 years ago

Florian,

size: For one SNP, window size around this SNP to compute correlations. Default is ‘500’. If not providing ‘infos.pos’ (‘NULL’, the default), this is a window in number of SNPs, otherwise it is a window in kb (genetic distance).

Bjarni, my intuition was similar to what you said, that it had to do with the window size. For ldpred1, I chose 1,000 precisely because I tested other values and in terms of transferability this tended to give the best results although I didn't do an exhaustive search for optimal window size.

Follow up questions for both: -am I interpreting the snp_cor function incorrectly in that if infos.pos is NULL then size is the number of SNPs? If this interpretation is correct then snp_cor(size=1000, infos.pos=NULL) should be analogous to ldpred1 with --ldr 1000, correct? -another difference here is that ldpred1 by default constraints the intercept==1 in the ld score regression estimate of heritability, whereas ldpred2 does not. As a result, heritability estimates are lower in ldpred1 and that would affect posterior betas. Any ideas about how this could explain the discrepancy?

I will report back when I have new findings. Thank you both for your feedback.

privefl commented 3 years ago
bvilhjal commented 3 years ago

Hi,

Thanks for doing this Bárbara, it's really interesting work! I have a couple of points as well.

Best, Bjarni

ps. I really liked your G3 paper.

privefl commented 3 years ago

Any update on this?