privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
196 stars 44 forks source link

Genetic correlation using LDpred2 summary statistics #360

Closed alek0991 closed 2 years ago

alek0991 commented 2 years ago

Dear Florian, I have made a lot of progress thanks to your helpful responses to my questions in the past few weeks and I highly appreciate that. I am trying to calculate the genetic correlation between phenotypes and I was wondering if I can use the joint effect predicted by ldpred2 to calculate the genetic correlation between phenotype $x$ and $y$ like below: $$C_nx=CnG\lambda{joint}$$ $$C_ny=CnG\gamma{joint}$$ $$~~~x^TCny=\lambda{joint}^TG^TCnG\gamma{joint}~~~(1)$$ From the ldpred2 paper we have: $$G^TCnG=(n-1)SRS$$ $$RS\gamma{joint}=S\gamma_{marg}$$ If we plug them into Eq. 1 we have: $$\frac{1}{n-1}x^TCny=\lambda{joint}^TS^2\gamma{marg} $$ Therefore, $$cov(x,y)=\lambda{joint}^TS^2\gamma{marg}$$ Probably due to approximations, estimated cov(x,y) and cov(y,x) are not necessarily identical. Therefore, the genetic correlation is estimated like below: $$r{xy} = \frac{cov(x,y)+cov(y,x)}{2\sqrt{cov(x,x)cov(y,y)}}$$

privefl commented 2 years ago

$\gamma$ are the effects on the allele scale (e.g. from the marginals from GWAS or the joints you want to use for polygenic scores), while $\beta$ are the effects of the scaled genotypes, which is the scale I use internally in LDpred2.

I'll test your equation and get back to you.

privefl commented 2 years ago

If you forget about this approximation $$RS\gamma{joint} \approx S\gamma{marg}$$ you get $$cov(x,y)=\lambda{joint}^TSRS\gamma{joint}=\delta{joint}^TR\beta{joint}$$ where $\delta$ and $\beta$ are the scaled versions.

privefl commented 2 years ago

This is the formula I use to get $h^2_x=cov(x,x)$. For two different iterations, I get $r^2_x = \delta_1^T R \delta2$ (still need to justify why, but it works very well in practice). So, I think $\delta^T R \beta$ should give the co-predictability, and not the co-heritability, which will underestimate $r{xy}$.

privefl commented 2 years ago

Sometime ago, I tried instead using $$r_{xy} = \frac{\delta^T R \beta}{\sqrt{r_x^2 r_y^2}}$$ IIRC, it worked well in practice in case of complete sample overlap between GWAS of x and y, but not when there was no overlap.

bvilhjal commented 2 years ago

Hi @alek0991

I do believe one should be able to account for sample overlap, e.g. by inferring it in the Gibbs sampler.

Best, Bjarni

alek0991 commented 2 years ago

Dear Florian and Bjarni, Thanks for your useful comments. I have been testing all these formulas and they are usually very similar when the sd_ldref and sd_ss are matching well.

However, I am working with pan-ukbb summary data which is mixed model using SAIGE and probably not the best summary data especially for binary traits with large-effect loci like red hair, bilirubin, and celiac disease.

As you have already mentioned in Fig. S15 of your recent paper, regions with high LD can correlate with PCs and create outliers in sd_ldref vs. sd_ss plot. It seems when a large effect loci overlaps with a high LD region the estimated joint effects become unreliable. I have already mentioned the celiac disease case here https://github.com/privefl/bigsnpr/issues/354#issue-1325122440 and I have seen similar pattern for red hair, bilirubin and a few other binary traits with large-effect loci. I should mention that for quantitative traits everything looks normal as far as I have noticed.

privefl commented 2 years ago

Yes, sumstats from LMM or GLMM are not recommended to use with LDpred2. My guess is that this is a problem for most sumstats-based methods.