rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
189 stars 55 forks source link

GWAS results look strange #153

Closed yystat closed 3 years ago

yystat commented 3 years ago

Hello,

I'm using regenie (v2.2) on a binary trait (about 3K cases vs. 12K controls) and I got a very strange result. See the attached Manhattan plot below. It seems that all the SNPs from chromosome 3, 12, and 20 have p-values larger a fixed value (which seems to be 0.05). I also check the number of SNPs, these chromosomes do not seem to have unusually small number of SNPs.

Note, I also run the case-control GWAS using a subset of the case (so several hundreds cases vs. about 12K controls) and the results look normal.

Thank you very much for your help!

Here is a number of SNPs in each chromosome:

1       2       3       4       5       6       7       8       9      10 
1060118 1168594  725325  984940  904579  868231  796670  761209  579537  673011 
     11      12      13      14      15      16      17      18      19      20 
 666381  461090  488992  434951  371772  409460  343748  372258  282693  203172 
     21      22 
 160312  163990

Here is the distribution of the P values from chromosomes 3, 12, and 20. Indeed the SNPs have a minimum P value of 0.05:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.0500  0.3147  0.5394  0.5351  0.7577  1.0000

Here is my codes:

# step 1 on genotyped data with "good" SNPs:
/Software/regenie/v2.2/regenie  \
  --step 1 \
  --bt \
  --gz \
  --cc12 \
  --loocv \
  --bed QCed_combined \
  --extract qc_pass.snplist.txt \
  --keep FID_IID_Pheno_Sex_20PC.txt \
  --phenoFile FID_IID_Pheno_Sex_20PC.txt \
  --phenoColList Pheno\
  --covarFile FID_IID_Pheno_Sex_20PC.txt \
  --covarColList Sex,PC{1:20}\
  --catCovarList Sex\
  --bsize 1000 \
  --lowmem \
  --lowmem-prefix tmp_regenie_step1 \
  --out results/step1

# step2 on imputed data:
/Software/regenie/v2.2/regenie\
  --step 2 \
  --bt \
  --cc12 \
  --gz \
  --threads 22\
  --bsize 400 \
  --bed imputed_qced \
  --keep FID_IID_Pheno_Sex_20PC.txt \
  --phenoFile FID_IID_Pheno_Sex_20PC.txt \
  --phenoColList Pheno\
  --covarFile FID_IID_Pheno_Sex_20PC.txt \
  --covarColList Sex,PC{1:20}\
  --catCovarList Sex\
  --pThresh 0.05 \
  --firth --approx \
  --pred results/step1_pred.list \
  --out results/step2_logistic

test

joellembatchou commented 3 years ago

Hi,

Could you check by testing on the odd looking chromosomes (chr3,12&20) whether you observe the same behavior with v2.0.2? You can use --chrList 3,12,20 to filter the analysis down to these chromosomes.

yystat commented 3 years ago

Thanks, I tried the newer version (2.2.2) and now the results look normal.

I really like Regenie since it's so fast. Empirically, it seems that it's less powerful than BOLT-LMM for continuous traits, right? Is there a way to boost the power? For binary traits, I compared Regenie vs. Plink, Regenie also seems to be less powerful at the known causal loci.

Thank you for your comments in advance.

joellembatchou commented 3 years ago

The model used in Step.1 of Regenie is closely related to the infinitesimal model used in many GWAS methods to capture genome-wide polygenic effects and thus can obtain lower power for traits with sparser genetic architecture, for which a model which models sparsity would be better (such as the one in BOLT-LMM). This is something we are looking into. When comparing to PLINK, did you subset to unrelateds and use PCs in the model (as PLINK does not have a model to capture relatedness which can lead to inflation if unaccounted for)?

yystat commented 3 years ago

Thanks, that makes senses.

For binary traits, all my samples are unrelated but I also included PCs in covariates (for both Plink and Regenie). I think since Regenie can account for relatedness without PCs, adding PCs may adjust the population stratification too much and reduce the power (Plink uses PC vs. Regenie uses ridge regression + PC making the comparison unfair)?

joellembatchou commented 3 years ago

There were bugs with v2.2-v2.2.3 with binary traits which were addressed on v2.2.4. That should explain the strange results observed with v2.2. It is standard practice for GWAS to include PCs as covariates and we also recommend that for Regenie; how different are the p-values between the two approaches (like differences in many orders of magnitude?)