privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
186 stars 44 forks source link

GWAS SNPs failing QC due to large SD_SS > 1 in case-control GWAS #252

Closed sritchie73 closed 3 years ago

sritchie73 commented 3 years ago

I've solved this problem, but wanted to document it here in case others encounter the same issue.

Problem: Nearly all of the SNPs in one of the case/control GWASs I was looking at were failing the recommended QC, and the computed SD_SS were extremely large (majority between 1 and 5, where values between 0 and 1 expected).

The GWAS in question was for Smoking Initiation (case/control) status in Biobank Japan (https://www.nature.com/articles/s41562-019-0557-y)

Root cause and solution: The case/control GWAS was run using BOLT-LMM, which fits a linear model even for binary traits. The BETA and SE estimates in the published summary statistics therefore needed to be adjusted for observed case load (logOR = BETA/(u (1-u)), logOR_SE = SE/(u (1-u)), where u = cases / total samples) as recommended by the BOLT-LMM manual: https://alkesgroup.broadinstitute.org/BOLT-LMM/BOLT-LMM_manual.html#x1-5300010

Once adjusting the SE and BETA as described above, the SD_SS estimates were sensible and the majority of SNPs passed QC.

privefl commented 3 years ago

For linear regression, you need to use the total sample size (non missing values reported by BOLT), and also replace the 2 in the numerator by sd(y) = u * (1-u).

Does this solve the problem too?

sritchie73 commented 3 years ago

I didn't see the same problem with the GWAS summary stats for continuous traits from the same paper (E.g. Cigarettes per day).

For continuous traits I have been using:

a1freq = sum(dosages)/sum(!is.na(dosages)) sd_val = sqrt(2 a1freq (1 - a1freq)) sd_y = median(sd_val beta_se sqrt(sample size)) sd_ss = sd_y / (beta_se * sqrt(sample size))

sritchie73 commented 3 years ago

My take on the issue is that:

(1) when downloading summary statistics the "beta" column normally corresponds the log odds for case control GWAS (2) this is not the case for summary statistics output by BOLT-LMM, and this isn't made obvious (e.g. in the GWAS Catalog) (3) If you see the issue above (really large sd_ss > 1) this is likely what's happened, and (4) the solution is to convert the beta and its standard error to log odds using the formula in the BOLT-LMM documentation