privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
183 stars 43 forks source link

Unusually High h2est Values: Exceeding 1 and Reaching up to 2 or 3 #420

Open PEWilliamZhou opened 1 year ago

PEWilliamZhou commented 1 year ago

ldpred2_auto_chain_convergence Dear Author,

I have been using LDpred2-auto for my analyses, and I encountered an issue where many of the h2est values I obtained exceeded 1, with some converging around 2 or even 3. I have checked the chains and confirmed that they have indeed converged. Out of the 42 outcomes I analyzed, only one provided a reasonable h2 estimate (h2=0.67 and in my test set R2=0.75, this is a good result). For my analysis, I used the LD matrix provided by you and applied it to GWAS data from the Icelandic population. The selected outcomes should all have high heritability (R2 > 0.7). I used the QC process according to your guidance: "sd_ldref <- with(gwas_match, sqrt(2 ImpMAF (1 - ImpMAF))) sd_ss <- with(gwas_match, 1 / sqrt(N SE^2+beta^2)) fail_qc <- sd_ss < (0.7 sd_ldref) | sd_ss > (sd_ldref + 0.1) | sd_ss < 0.1 | sd_ldref < 0.05 "

I am wondering if you could provide any guidance or suggestions on how to address this issue, or if there is anything I might have overlooked in the process.

Thank you for your assistance.

privefl commented 1 year ago

Can you tell me more? Are these continuous traits? Binary traits? What is the sample size? Can you show one of the QC plots comparing SDs? What is the estimate you get from LD Score reg?

PEWilliamZhou commented 1 year ago

Thank you for your quick response! These are all continuous traits, specifically proteomics traits. For instance, take the trait mentioned above (5810_25_TDGF1_Cripto), which has effective sample sizes (n_eff) of 34,527~35,449. The number of SNPs that passed QC is 636,782, while 750,760 did not pass (for most traits, more than 1,380,000 matched and over 1,000,000 passed QC, but they still have h2 >1). The LDSC results are as follows: int 0.940522128 int_se 0.010463805 h2 1.620702647 h2_se 0.943200144 Here's the QC figure: ldpred2_gwasqc_5810_25_TDGF1_Cripto

privefl commented 1 year ago

There is something going on with this QC plot; you should not filter all these variants.

Were the sumstats derived from a linear mixed model like BOLT-LMM? Do you have a mix of both simple linear reg and mixed model maybe?

Could you color this plot by chromosome, and then by INFO score (if you have this info).

PEWilliamZhou commented 1 year ago

I used deCODE summary statistics from icelandic population (linear mixed model implemented in BOLT-LMM). I don't have INFO. 1681474925007 1681475321017 I also want to let you know that I tried using both ld <- c(ld, Matrix::colSums(corr_chr^2)) (calculated by selected SNP matrix) and the ld in your hm3+, and tried using your af_UKBB in map file instead of af from deCODE, and they all showed bad results with h2>1. You mentioned that I should not filter all these variants. Can you tell me the correct way to perform QC?

I would like to share with you my code related to LDpred2-auto. part of my code.txt

privefl commented 1 year ago

Could you provide the link where I can download these sumstats?

PEWilliamZhou commented 1 year ago

Sure! Here is the link to the summary statistics file: list43.txt I recommend starting with the first 6 lines. I could only obtain reasonable results for SIGLEC9_Siglec_9, while the other traits provided incorrect h2 estimates. This is deCODE readme file. proteomics_readme.txt

PEWilliamZhou commented 1 year ago

Could you provide the link where I can download these sumstats?

Dear Privefl,

Have you tried with the summary statistics? Please let me know if you found something, or if you need any further information.

privefl commented 1 year ago

I've tried on the first one, and have the same problem.

Given that these are mixed-model sumstats (which breaks some assumptions of polygenic score models), and that some effects are so large, I think it is quite difficult to estimate the heritability here.

If you can get normal linear regression summary statistics, I think it would help.

PEWilliamZhou commented 1 year ago

Thank you very much for your help!

If I were to continue using the mixed-model summary statistics and do not have a validation set available, do you have any recommendations or suggestions on how to proceed?

privefl commented 1 year ago

Depends on what you want to do:

PEWilliamZhou commented 1 year ago

Dear author, Thank you for your help.

I tried LDpred2-auto and your HM3+ on two other sumstats with linear model: FENLAND study and AGES study (selecting several proteomics traits that have h2 ~0.7). I still got h2 >1.

I think my code is correct, since I can get normal result with your "public-data3-sumstats.txt".

Could this be because the external LD reference is being used instead of the LD from the summary statistics samples?

My aim is to compute those PGS without validation set. If I cannot estimate correct h2, is there a way to get a reliable PGS? 1682631827833

privefl commented 1 year ago

Getting a good estimate of h2 is not a guarantee for getting a good PGS. Similarly, a bad estimate does not necessarily mean that you get a bad PGS. Prediction and inference are kind of two different things. What are the range that you get?

Do you have links for the FENLAND / AGES sumstats?

PEWilliamZhou commented 1 year ago

a_trait.zip For this traits, range is around 0.97.

https://drive.google.com/file/d/1lmnpNDpPiwliMvd2Bb33hPpzzVJkLFp9/view?usp=share_link Its ranges are around 1.6 or around 0.001

privefl commented 1 year ago

What is the sample size for the second one?

PEWilliamZhou commented 1 year ago

What is the sample size for the second one?

5368

privefl commented 1 year ago

Using sd(y)=1, the QC plot is very good. qcplot

I'll run LDpred2-auto now.

privefl commented 1 year ago

LD score reg results:

       int     int_se         h2      h2_se 
1.17693343 0.01450483 2.83327279 2.82301985

Large effects and small N -> complicated

privefl commented 1 year ago

Indeed, all range are 1.60-1.61 and the heritability estimate is 3.1 [2.6-3.6].

If you look at the estimated r2 for each variant separately r2 <- with(sumstats, beta^2 / (beta^2 + n_eff * beta_se^2)) there is one variant that explains 82.5% of the phenotypic variance.

I guess LDpred2 cannot cope with such large effects, especially with such a small GWAS sample size. A common strategy in this case is to use this effect as fixed with its GWAS sample size (or use a few variants with COJO), and make a predictor with LDpred2 based on the residualized sumstats (you can get this from COJO maybe, or just consider all other variants that are not correlated with the top one, e.g. from different LD blocks).

bvilhjal commented 1 year ago

Running LDpred2 on COJO residualized summary stats is maybe not a very common idea, but a good idea in this situation. Using COJO you might have to account for a couple of variants. Also, when residualizing the GWAS sum stats it's a good idea to update even the distant (not in LD) effect sizes, as their relative effect sizes will increase when a large part of the phenotypic variance is explained. See e.g. https://www.medrxiv.org/content/10.1101/2022.11.09.22281216v1 as an example.

privefl commented 1 year ago

I don't think effect sizes would change, just their SE would be smaller.

bvilhjal commented 1 year ago

Yes, true. It becomes more significant, and therefore the z-score increases.

privefl commented 1 year ago

I am trying to reproduce this behavior in simulations, but this is a bit challenging.

xuxwen commented 1 year ago

a_trait.zip For this traits, range is around 0.97.

https://drive.google.com/file/d/1lmnpNDpPiwliMvd2Bb33hPpzzVJkLFp9/view?usp=share_link Its ranges are around 1.6 or around 0.001 Hello, I would like to know if this AGES data you provided is the QTL data of the protein, is this a protein data, I downloaded and looked at it and I don't know if it is a protein data, there are about 7.5 million SNP numbers in total.

PEWilliamZhou commented 1 year ago

Hello. Yes it is protein data. I forget which one it is but most proteomics data get abnormal h2.

xuxwen commented 1 year ago

Hello. Yes it is protein data. I forget which one it is but most proteomics data get abnormal h2.

We used the same data as you used for Iceland, and h2 had the same anomaly and periodic failure to converge. I did not find this AGES data you used, can you provide a link to the article.

privefl commented 1 year ago

An update on the simulations. I can definitely reproduce the very large LDSc h2 estimates. But then I get normal h2 with LDpred2-auto in simulations. I also get normal LDSc intercept. Anything going on with the 1.177 estimate we get for the intercept? Pop structure or relatedness maybe?

bvilhjal commented 1 year ago

I think training LDpred2 on the Ferkingstad et al. (Nat Genet 2021) data (currently) has some challenges. First, these are BOLT-LMM summary statistics, which are different from standard least square or logistic regression marginal effect sizes. The BOLT-LMM sum stats might also be slightly inflated due to family structure (see e.g. Jiang et al., Nat Genet 2019). Second, the sample has rather strong sample ascertainment, which has been known to bias heritability estimates. According to the paper, the participants are heavily enriched to be cancer patients and likely with for other diagnoses as well. Third, assuming a point-Normal prior for some protein levels might simply be suboptimal?