privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
186 stars 44 forks source link

Quality control of summary statistics exclude (almost) all SNPs #349

Closed jianvhuang closed 2 years ago

jianvhuang commented 2 years ago

Hi Florian,

I tried LDPRED2 on GWAS with small sample sizes (8000 to 35000), and the QC procedure based on the below script categories almost all of the SNPs as "bad". My target population also has a small sample size (<1000).

is_bad <- sd_ss < (0.5 * sd_val) | sd_ss > (sd_val + 0.1) | sd_ss < 0.1 | sd_val < 0.05

I wonder if you have any advice on this.

Thank you.

privefl commented 2 years ago

Can you show me the plot?

jianvhuang commented 2 years ago

See below three examples

QQPLOT-SummaryStatQC-example1

QQPLOT-SummaryStatQC-example2

QQPLOT-SummaryStatQC-example3

privefl commented 2 years ago

The first one looks good, it is probably just you forgot the sd(y) in the estimate of sd_ss.

You can estimate it with e.g. with(df_beta, sqrt(quantile(0.5 * (n_eff * beta_se^2 + beta^2), 0.01))).

jianvhuang commented 2 years ago

I used sd_ss = with(df_beta, 2 / sqrt(n_eff * beta_se^2)) to estimate sd_ss, is this only for binary traits? Since my GWAS traits are quantitative, I should use the below, right? sd_y=with(df_beta, sqrt(quantile(0.5 * (n_eff * beta_se^2 + beta^2), 0.01))) sd_ss = with(df_beta, sd_y / sqrt(n_eff * beta_se^2))

privefl commented 2 years ago

Yes, you should try that. (Don't forget the additional + beta^2, in case you have some large effects)

jianvhuang commented 2 years ago

Thank you very much! I will try that.

jianvhuang commented 2 years ago

So to be clear,

In case of large effects, I should add + beta^2 in the calculation for both sd_y and sd_ss, right?

sd_y=with(df_beta, sqrt(quantile(0.5 * (n_eff * beta_se^2 + beta^2), 0.01)))
sd_ss = with(df_beta, sd_y / sqrt(n_eff * beta_se^2+ beta^2))

And if I consider the effect is not large, I should remove beta^2 from both equations, right?

sd_y=with(df_beta, sqrt(quantile(0.5 * (n_eff * beta_se^2 ), 0.01)))
sd_ss = with(df_beta, sd_y / sqrt(n_eff * beta_se^2))
privefl commented 2 years ago

You should always use the beta^2, it does one approximation less, so it should be a better fit. But that makes a difference only when there are large effects, otherwise it is just negligible.

jianvhuang commented 2 years ago

Thank you for clarifying.