privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
186 stars 44 forks source link

Summary statistics comparison #335

Closed rameez500 closed 2 years ago

rameez500 commented 2 years ago

Hi Florian,

I have two summary statistics (Aud and Audit_C) and one target dataset. I’m trying to follow Quality control steps in summary statistics sections: (https://github.com/privefl/paper-ldpred2/blob/master/code/prepare-sumstats.R). I calculated standard deviation for summary statistics and target datasets as shown below.

sd <- sqrt(big_colstats(G, ind.val, ncores = NCORES)$var) sd_val <- sd[info_snp$_NUM_ID_] sd_ss <- with(info_snp, 1 / sqrt(n_eff / 4 * beta_se^2))

I noticed that correlation is almost equal to 1 for both standard deviation of Aud and Audit_c summary statistics; shown in Figure 1. The figure 2 & 3 show the standard deviation of summary statistics vs target dataset. I found a lot of purple dots for Audit_C, even though correlation of standard deviation for both summary statistics is 1. Do you think, we should still remove the purple dots (SNP ID) for the Audit_C ?

Figure 1:

image

Figure2: Alcohol use disorder (AUD)

image

Figure3: Alcohol consumption (AUDIT-C )

image

Thank you so much

privefl commented 2 years ago

No, it is just a problem of calibration. It happens:

privefl commented 2 years ago

Any update on this?

rameez500 commented 2 years ago

Hi Florian,

Thanks for getting back to me.

I am a little confused about phenotype. I didn't use phenotype data to generate plot of standard deviation in the validation set vs standard deviation in summary statistics. Instead I only used the Genotype dataset in line 59 (https://github.com/privefl/paper-ldpred2/blob/master/code/prepare-sumstats.R) and summary statistics used in line 60. My question is that whether the phenotype data relates to quality control of summary statistics?

Thank you so much for your help.

privefl commented 2 years ago

If the GWAS summary statistics were derived based on linear regression, there is the sd of the phenotype used in the GWAS that you need to account for in the formula. Please look at the corresponding section in the LDpred2 paper.

privefl commented 2 years ago

Any update on this?