privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
186 stars 44 forks source link

prepare sumstats #281

Closed pjordab closed 2 years ago

pjordab commented 2 years ago

Hi Florian,

I am using LDpred2 with the results of a meta-analysis and when applying QC (https://github.com/privefl/paper-ldpred2/blob/master/code/prepare-sumstats.R) all my SNPs are discarded.

I attach the graph. Any advice?

Thank you very much.

Graphic-Post !

privefl commented 2 years ago

Either

pjordab commented 2 years ago

Thank you very much for your prompt response. I have calculated the effective n using the formula Neff = 4 / (1 / cases + 1 / controls) as these are sumstats of a binary trait.

privefl commented 2 years ago

That's weird; are you sure these come from logistic regression? If you have a link to the summary data, I can have a quick look at these summary statistics.

pjordab commented 2 years ago

Thank you very much Florian for your help, it is greatly appreciated! The data I am using is internal data from my group that is not yet ready to be publicly released.

privefl commented 2 years ago

Which methods did you use to perform the GWAS and meta-analysis?

pjordab commented 2 years ago

Hi Florian, Sorry for the late reply. The sumstats stem from MTAG. My SNPs are removed due to this filtering criteria: sd_ss > (sd_val + 0.1). Is it appropriate to use these sumstats in LDpred2 or should I adjust my beta/beta_se/neff values somehow before using them? Many thanks!

privefl commented 2 years ago

Which n_eff are you using from MTAG then? Do you really get beta and beta_se from MTAG, or just z-scores?

pjordab commented 2 years ago

I get beta and beta_se from the MTAG (https://github.com/JonJala/mtag/blob/master/mtag.py)

I use the GWAS-equivalent sample size which is calculated:

Neff GWAS * (mean chi^2 MTAG -1)/(mean chi^2 GWAS -1)

privefl commented 2 years ago

Try maybe instead to use the median ratio of X2-stats for X2 > 30 (as done in BOLT-LMM).

From the plot you have, it seems that the effective sample size you're using is too small.

Otherwise, try to estimate Neff directly from the median of the values from equation (4) of https://doi.org/10.1101/2021.03.29.437510.

pjordab commented 2 years ago

When you say mean values, you mean calculate an effective n for the whole sample and take the median of beta and the median of beta_se?

(4/var-median_beta^2)/median_beta_se^2

What value do I take as the sample variance?

Or do I calculate the N effective per SNP, and use the variance per SNP (2MAF(1-MAF)

Thank you!

privefl commented 2 years ago

Calculate per SNP, and then take the median.

pjordab commented 2 years ago

Hi Florian, this worked. Sincere (and many!) thanks for all your previous answers and help.

I'd like to ask you some additional questions about the preparation of sumstats.

1) In case I am using my own genotypes to calculate the correlation matrix, should I apply only the QC recommended here?

https://github.com/privefl/paper-ldpred2/blob/master/code/prepare-sumstats.R

2) And when using the LD reference provided should I apply only the QC recommended here (and not the previous one)?

https://github.com/privefl/paper-ldpred2/blob/master/code/example-with-provided-ldref.R (line 27-31)

Finally, one last doubt. When the paper mentions in the last paragraph of the discussion "However, LDpred2-auto requires some QC to be performed on the summary statistics",

  1. Does it mean that this QC is only relevant for the Auto model? Or also for the Grid and Infinitesimal models?

I understand that if I use all 3 models the most practical is to calculate my correlation matrix with the SNPs after QC and from there follow separately in each of the methods, but if I only use Grid, for example, then I don't need to perform the QC?

privefl commented 2 years ago

The QC should be about the same.

The QC is more important for LDpred2-auto, but I would suggest doing it for LDpred2-grid as well.

pjordab commented 2 years ago

Hi Florian,

Huge congratulations for your recent article in AJHG, it's a great job, very interesting!

Regarding the equation 1 update you have described there,

is it updated in this version of bigsnpr?

Version: 1.8.1 Date: 2021-05-27

And in relation to QC, as equation 1 is been updated from:

sd = sd(y) / (se * sqrt(n))

to:

sd = sd (y) / sqrt (n se beta^2 + beta^2)

Would equation 2 be as follows?

sd = 2/(se * sqrt(neff))

to

sd = 2 / sqrt (se neff beta^2 + beta^2)

And should we still use n effective in this formula?

Many thanks for your help and your work!

privefl commented 2 years ago

Thanks!

Yes, these are updates to the previous formulas (less one approximation, note the added beta^2 term).

Note that you should read beta_se^2 instead of se * beta^2.

pjordab commented 2 years ago

Oh, thanks!

I messed up with the parenthesis in the original formula.

So:

sd = sd (y) / sqrt (n * se ^2 + beta^2)

sd = 2 / sqrt (neff * se^2 + beta^2)

And last question, from which version of the code is it updated? I currently have installed this one:

Version: 1.8.1 Date: 2021-05-27

privefl commented 2 years ago

I think it was updated in v1.5.6.

pjordab commented 2 years ago

Great!! Thank you!!!