privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
186 stars 44 forks source link

Negative h2_init in LDpred2 auto model #299

Closed JulieanneKnupp closed 2 years ago

JulieanneKnupp commented 2 years ago

Good day, Florian

I'm not sure if you remember me, but I was the one with all the LDpred2 questions during the Q&A at the WCPG 2021...

Anyway, I'm finally almost done with computing PRS for my (small) dataset using the auto model, but have run into an error when trying to calculate 'multi_auto', getting this error: "Error: 'h2_init' should have only positive values."

When I trace 'h2_init' back (as h2_init = h2_est) to the LD score regression my output is: int h2 2.699634 -72.083050

I see that in a previous issue you noted that negative h2 can be calculated, but I was just wondering whether as it is not a percentage whether there is something wrong with my calculation, why it is a negative, and how you would suggest moving forward to ensure the most accurate PRS?

Please let me know if there is some information that I am missing and thank you so much for your help in advance!

I hope you are having a wonderful day!

privefl commented 2 years ago

These ldsc reg results are odd. I would check the information you pass to ldsc, e.g. the distribution of the LD scores and the chisq statistics.

JulieanneKnupp commented 2 years ago

Ah, okay. Thanks! I did feel like something was odd.

I've had a brief look at histograms of each (please let me know if there is a more thorough distribution check you would recommend), and attached them both below: Hist dist of chi2 Hist dist of LD

My guess is that the issue is the chi2 stats - so could there be something weird about my beta and beta_se values? Is there a check I can do for that as well?

My apologies for all the questions - I appreciate your help in this immensely!

privefl commented 2 years ago

chisq-stats looks okay-ish, but you have nothing gwide-significant (> 30). However, ld scores look quite small; how many variants do you use?

JulieanneKnupp commented 2 years ago

Hmm, I think that makes sense as, while I am using PGC data (predominantly EUR) as my sumstats, my target population is a highly diverse South African population - so the matched SNPs are quite limiting (and likely not those that would be genome wide significant). I have 19,107 variants that I'm using.

privefl commented 2 years ago

I guess that's the problem. That's too few variants to run LD score reg, and also to run LDpred2 I guess.

I'm surprised that you don't have more overlap, as the imputed PGC data should be fairly large (e.g. > 5M variants).

privefl commented 2 years ago

Also, it seems you are computing ld yourself. An alternative that might work better here is to use the one I report with the LD reference (and do not forget to use M as the total number of variants in the LD ref) for running LD score reg.

JulieanneKnupp commented 2 years ago

Ah, I see. My apologies again for the silly question, but how many variants (roughly) would you need to run LD score reg?

I think the issue might have been that I used the PGC pre-clustered SCZ dataset for PRS, which reduced the dataset quite a bit (after QC, only 31,667 variants). However, my target dataset only has 22,641 variants after QC... Would this already be too few, or would you suggest imputing it first as well? But I'll try it again with a non-PRS-specific dataset and hopefully that will leave me with more variants to use. Thank you so much for all your help!

privefl commented 2 years ago

A typical dataset has 200-800K genotyped variants or millions of imputed variants; I don't understand why you have so few?

privefl commented 2 years ago

Any update on this?