rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
189 stars 55 forks source link

majority of tests ignored due to low MAC #550

Open Xuemin-Wang opened 2 months ago

Xuemin-Wang commented 2 months ago

Dear REGENIE developers,

I'm using REGENIE v3.5.gz on 237 cases and 387 controls. Genotypes of my samples were jointly called from WGS data, which had 50,811,891 variants. It was mentioned that 44,639,814 variants were dropped due to low MAC (default 5; "Number of ignored tests due to low MAC : 44639814"), resulting in 6,172,077 variants in the output file. To investigate whether there were so many variants with a MAC < 5, I firstly prefilter variant by plink as below and found there were 18,879,304 variants that had a MAC >= 5.

plink2 \ --pfile ../bgen/final \ --keep ../qcfiles/eur_samples_to_keep.txt \ --set-missing-var-ids @:# \ --mac 5 \ --make-pgen \ --out ../bgen/final624.mac5

I re-ran regenie using the prefiltered genotypes (../bgen/final624.mac5) and found two thirds of variants were still dropped, leaving 6,085,552 variants in the output result file. Those ignored variants had a MAC of 5 or above and shouldn't have been dropped from the test.

autosomal variants used step 1 were LD pruned and filtered by --maf 0.02 --hwe 1e-6 by plink. variants in the MHC region were not included in step 1 prediction as shown below.

pruning to remove highly correlated SNPs

plink2 \ --pfile ../bgen/final \ --exclude range ../qcfiles/mhc_range.txt \ --keep ../qcfiles/eur_samples_to_keep.txt \ --set-missing-var-ids @:# \ --rm-dup exclude-all \ --maf 0.02 \ --hwe 1e-6 \ --indep-pairwise 500 50 0.1 \ --out ../bgen/final.step1

generate regenie step 1 data

plink2 \ --pfile ../bgen/final \ --chr 1-22 \ --keep ../qcfiles/eur_samples_to_keep.txt \ --extract ../bgen/final.step1.prune.in \ --set-missing-var-ids @:# \ --make-pgen \ --out ../bgen/final.step1.variants

Here's the script to run step 1. regenie \ --step 1 \ --pgen ../bgen/final.step1.variants \ --phenoFile ../pheno.txt \ --covarFile ../covariates.txt \ --covarColList SEX,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \ --bsize 1000 \ --lowmem \ --lowmem-prefix $TMPDIR/regenie_tmp_preds_all \ --bt \ --write-null-firth \ --out ../regenie/step_1_out/step_1

And here's first few lines of the step2 log file.

          |===========================|
          |      REGENIE v3.5.gz      |
          |===========================|

Copyright (c) 2020-2024 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini. Distributed under the MIT License. Compiled with Boost Iostream library. Using Intel MKL with Eigen.

Log of output saved in file : ../regenie/res/step2_final_ADD.log

Options in effect: --step 2 \ --pgen ../bgen/final \ --minMAC 1 \ --test additive \ --bsize 1000 \ --phenoFile ../pheno.txt \ --covarFile ../covariates.txt \ --covarColList SEX,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \ --pred ../regenie/step_1_out/step_1_pred.list \ --bt \ --firth \ --approx \ --firth-se \ --use-null-firth ../regenie/step_1_out/step_1_firth.list \ --af-cc \ --gz \ --out ../regenie/res/step2_final_ADD

Association testing mode with fast multithreading using OpenMP

Would you be able to help me out? Please let me know if other info is required for debugging.

Many thanks, Patrick

Xuemin-Wang commented 2 months ago

I checked the missingness of samples and variants. Samples had a missing genotype rate of 1.397% - 5.215%; whereas 16,543,401 out of 18,879,304 variants that had a MAC >= 5 had a missing genotype < 1%.

joellembatchou commented 1 month ago

Hi,

Could you please run step 2 with --write-samples to get the list of 624 sample IDs used in the analysis then pass that file to PLINK when applying the MAC 5 filter on the "../bgen/final" PGEN fileset?

Cheers, Joelle