bbitarello commented 3 years ago

Hi Florian,

I have been transitioning from ldpred1 to ldpred2 and as a sanity check I have been running a set of analyses on the exact same datasets. The summary statistics are from the UK Biobank and so is the LD reference panel (Eur ancestry). The test datasets include: one of European ancestry, one of admixed African and European ancestry, and one of (mostly) East Asian ancestry. For ldpred1 I used --ldr 1000 (which is what I had been using for a while) and for ldpred2 I am using the recommended window of 3 cM (size=3/1000 in the snp_cor function). Below is a summary of the comparisons. basically, for EUR ancestry estimates are highly correlated between ldpred1 and ldpred2, but not so for non-EUR ancestries. Questions: is this discrepancy expected? If so, why? Thank you

Eur ancestry

correlation between ldpred1 and ldpred2 PRS: 0.959 prediction accuracy: ldpred1 ldpred2 23.2% 23.3%

Afr+Eur ancestry

correlation between ldpred1 and ldpred2 PRS: 0.81 prediction accuracy: ldpred1 ldpred2 4.2%. 1.6%

East Asian ancestry:

correlation between ldpred1 and ldpred2 PRS: 0.653 prediction accuracy: ldpred1 ldpred2 8.7% 2%

PS: I tried running snp_cor in ldpred2 with size=1000 for comparison but that seemed computationally prohibitive.

privefl commented 3 years ago

No, I don't quite understand what is going on here. You're using the exact same two PGSs (same effects) but then the correlation is much smaller in other ancestry groups? Are there some large effects or something?
Are you using genetic positions when using 3 / 1000? How many/which variants are you using? What is the size of the .sbk file (the SFBM for LDpred2)?
You can't use 1000, it basically means you compute LD for all pairs of variants.

bvilhjal commented 3 years ago

Hi Barbara,

Thanks for doing this analysis. The results are not expected, and therefore also very interesting!

Assuming this isn't a bug, I hypothesize that difference in LD window sizes is what could be explaining this. The LD window used by ldpred1 is smaller than the one ldpred2 uses, and I suspect that could be a good thing when doing cross-ancestry prediction. The reason is that ldpred1/2 will in practice spread the weights among correlated variants, according to the LD. Under the ldpred1/2 model, this is the optimal thing to do. However, if we assume there are some untyped variants, or non-additive effects that are being tagged by some haplotypes in LD, then this may not always be optimal if LD changes between training and test data. This is obviously the case when training PRS on individuals of European ancestry and predicting into individuals of non-European ancestry. A large LD window might then spread the weights too widely, making the resulting PRS less robust/portable when doing cross-ancestry prediction. Hence, a smaller LD window, which ldpred1 uses compared to ldpred2, might end up being beneficial, and the faster the LD breaks down in the target sample, the smaller window you might want to use.

So, you can probably test this hypothesis by trying a smaller LD window in ldpred2.

Best, Bjarni

ps. Another and potentially more powerful approach would be to determine the window based on how much the LD in each region of the genome actually changes between training and target sample.

privefl commented 3 years ago

Yes, that would indeed be a very interesting result.

You can try with 0.5 cM instead maybe. You might also need to apply a bit more regularization (by trying hyper-parameter values for h2 that are very small, e.g. also 0.1 h2_ldsc along with the usual 0.7, 1, and 1.4 h2_ldsc).

bbitarello commented 3 years ago

Florian,

yes, I am using the same summary statistics.
For the analyses described above I was using snp_cor(size=3/1000) and infos.pos was using genetic positions after interpolation from genetic map based on hapmap using snp_asGeneticPos()
why would it be all SNPs? When I tried using size=1000 in snp_cor I was using infos.pos=NULL, which I understood meant that size is then interpreted as number of SNPs? That's what the manual says, right?

size: For one SNP, window size around this SNP to compute correlations. Default is ‘500’. If not providing ‘infos.pos’ (‘NULL’, the default), this is a window in number of SNPs, otherwise it is a window in kb (genetic distance).

I will try using 1 and 0.5 cM and report back.

Bjarni, my intuition was similar to what you said, that it had to do with the window size. For ldpred1, I chose 1,000 precisely because I tested other values and in terms of transferability this tended to give the best results although I didn't do an exhaustive search for optimal window size.

Follow up questions for both: -am I interpreting the snp_cor function incorrectly in that if infos.pos is NULL then size is the number of SNPs? If this interpretation is correct then snp_cor(size=1000, infos.pos=NULL) should be analogous to ldpred1 with --ldr 1000, correct? -another difference here is that ldpred1 by default constraints the intercept==1 in the ld score regression estimate of heritability, whereas ldpred2 does not. As a result, heritability estimates are lower in ldpred1 and that would affect posterior betas. Any ideas about how this could explain the discrepancy?

I will report back when I have new findings. Thank you both for your feedback.

privefl commented 3 years ago

my question was more, are you using the same two scores for predicting in all ancestries? Or do you rederive the two PGS again (maybe with new ref panels)
ok
ok if you're using infos.pos=NULL
there are still a few differences between LDpred1/2, even if you're using the same size (e.g. LDpred1 uses blocks, and does not account for LD between blocks)
LDpred2 does use constrained LD score regression by default as well (default in snp_ldsc2() used in the tutorial), but I don't think slight differences in the h2 hyper-parameter would change the results anyway.
Which sumstats are you using? Are there any large effects in there? Low-frequency variants?
Have you performed some QC on the sumstats? There are some internal QC done in LDpred1, e.g. MAF > 1%.

bvilhjal commented 3 years ago

Hi,

Thanks for doing this Bárbara, it's really interesting work! I have a couple of points as well.

Are you using LDpred2-auto?
It might also be interesting to see how lassosum2 would do as well, to see if that had the same issue.
Regarding how LDpred1 accounts for LD. It doesn't use LD blocks, only LDpred-inf used LD blocks. The Gibbs sampler uses a sliding window of a given SNP size.

Best, Bjarni

ps. I really liked your G3 paper.

privefl commented 3 years ago

Any update on this?

privefl / bigsnpr

Comparing ldpred1 and ldpred2 results for non-European ancestries #question #230

Eur ancestry

Afr+Eur ancestry

East Asian ancestry: