rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
180 stars 52 forks source link

Why regenie's step1 requires a single genotype file as input? #39

Closed yingxi-kaylee closed 3 years ago

yingxi-kaylee commented 3 years ago

Hi @joellembatchou @jonathanmarchini

I really like REGENIE's speed, it's so fast! Yet I have one question regarding the step1. I'm looking the UKBB analysis section at https://rgcgithub.github.io/regenie/recommendations/

Why do you merge the genotype across chromosomes before running step1? Merging all the genotype together requires lots of time and memory... Say if I want to run a GWAS on the UKBB data, can I separately run step1 for 22 times? Would it change the results, compared to doing step1 just one time as your recommendation?

Thanks a lot!

joellembatchou commented 3 years ago

Hi,

REGENIE only takes a single genetic file as input which is why you would need to merge files that are split by chromosomes. You could first apply QC filters (e.g. MAC/MAF) to get a smaller list of variants and generate a smaller genetic file for each chromosome and then combine chromosomes together to get the file for step 1 (you would only need a few 100,000s variants for that step).

We have not yet tried running step 1 on single chromosomes at a time but it is something we have planned to explore (note that you would need to do some manipulation to get the LOCO predictors needed for step 2 in the right format).

Cheers, Joelle