rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
187 stars 55 forks source link

Does REGENIE step 1 require independent SNPs? #497

Closed cs16436 closed 7 months ago

cs16436 commented 9 months ago

Hi,

Please may I check if REGENIE step 1 requires independent SNPs (does the genotype data need to be pruned during cleaning prior to running step 1)?

Thank you in advance!

Ojami commented 7 months ago

In theory since REGENIE relies on Ridge regression, it can handle variants in LD. However, that would cause unnecessary computational time. So, it's recommended to LD prune your variants before feeding them to REGENIE step 1.

From REGENIE paper:

a minor allele frequency of ≥1%, a Hardy–Weinberg equilibrium test not exceeding P = 1 × 10−15, a genotyping rate above 99%, not present in low-complexity regions, not involved in inter-chromosomal LD and LD pruning using a R2 threshold of 0.9 with a window size of 1,000 markers and a step size of 100 markers. This resulted in up to 471,762 genotyped SNPs that were kept in the analyses

Also here:

How many variants to use in step 1? We recommend to use a smaller set of about 500K directly genotyped SNPs in step 1, which should be sufficient to capture genome-wide polygenic effects. Note that using too many SNPs in Step 1 (e.g. >1M) can lead to a high computational burden due to the resulting higher number of predictors in the level 1 models.

Hope this helps Oveis

cs16436 commented 7 months ago

This is very helpful - thank you so much, Oveis.