reversed sign of Beta - Githubissues

privefl / bigsnpr

R package for the analysis of massive SNP arrays.

https://privefl.github.io/bigsnpr/

192 stars 44 forks source link

reversed sign of Beta #354

Closed alek0991 closed 2 years ago

alek0991 commented 2 years ago

Hi, I have been able to estimate the beta-auto using summary statistics from pan-ukbb with the LD-reference provided in the ldpred2 paper. For the Celiac disease which is a binary phenotype, sign of the estimated beta-auto for some variant in MHC region is the opposite of GWAS estimate (bottom-right corner of Figure 1). This is important for my work as I am comparing MHC and non-MHC region specifically. Does this make sense? If not, is there anyway I can fix this?

You can download relevant files from below in case needed. Download sumstats.csv Download multi_auto.rds

Also, I have attached similar plots for height and skin color for the sake of comparison.

Thank you! Ali

Figure 3: Celiac disease

Log:
1,031,956 variants to be matched.
0 ambiguous SNPs have been removed.
1,031,956 variants have been matched; 0 were flipped and 0 were reversed.

Beta_Z_6249

Figure 2: Standing Height

Beta_Z_4377

Figure 3: Skin color

Beta_Z_4004

privefl commented 2 years ago

This can happen due to LD; see e.g. fig S5 of https://doi.org/10.1016/j.ajhg.2019.11.001.

Quick simu showing this:

N <- 1000
set.seed(1)
X <- MASS::mvrnorm(N, rep(0, 2), matrix(c(1, 0.5, 0.5, 1), 2))
y <- X %*% c(-1, 3) + rnorm(N)

lm(y ~ X[, 1])  # marginal effect is positive due to correlation with 2nd column
lm(y ~ X)  # joint effect is negative

privefl commented 2 years ago

But there is a quite strong pattern for large effects, so there may be something odd there.

Maybe there is a problem of fitting with the multi-ancestry sumstats, try smaller values for shrink_corr (e.g. 0.5) to add more regularization.

alek0991 commented 2 years ago

I only use the summary stats for the EUR population of the pan-ukbb.

I have been playing with shrink_corr. The pattern doesn't change for shrink_corr=0.5. However, it start changing when shrink_corr<=0.1. This (shrink_corr<=0.1) is too much regularization, right?

Please note for the following plot (not the previous ones) I ran ldpred2-auto only for chr6. RED: MHC variants. BLUE: non-MHC variants.

shrink

privefl commented 2 years ago

shrink_corr = 0 means that you assume that there is absolutely no LD between variants, which is bad.

You should check frequencies with another reference I guess.

alek0991 commented 2 years ago

As I have mentioned above, ldpred2 reverses the sign of beta of some key variants in the MHC region that leads to a downstream result which is the opposite of P+T. In the following figure you can see that variants with reversed beta have higher sd_ldref/sd_ss.

In cases like this where ldpred2 and P+T are not consistent which one should we trust?
Could this inconsistency between P+T and ldpred2 be due to using the mixed model summary stats?

P.S.

I tried different LD references, matched frequencies, played with allow_jump_sign, shrink_corr, per chromosome, per chromosome chunk, and etc. However, ldpred2 stays robust for this specific phenotype (and some other correlated phenotype).
I also tried lassosum2 and it is showing a very similar pattern.

privefl commented 2 years ago

P+T just uses the effects from the GWAS, so these are just the marginal effects.

To be sure, I would just redo the GWAS in UKBB.

alek0991 commented 2 years ago

I have noticed the z score of the heritability estimate (z_h2 = h2/h2_se) is not significant (large variance). What is the minimum acceptable h2 and h2_z to be able to use the ldpred2?

privefl commented 2 years ago

LDSc regression tends to be bad for traits with small polygenicity, so do not focus too much on that.

You can always use LDpred2, the worst that can happen is that you PGS has no predictive power.