rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
177 stars 52 forks source link

Any WDL/docker implementation for a parallelised Phase I? #294

Open ggstatgen opened 2 years ago

ggstatgen commented 2 years ago

Hi everyone

This is more of a request for help/pointers than an issue.

Has anyone attempted dockerizing/implementing parallelisation for Regenie Phase I? I have found some tips in the wiki but not sure how to turn those into a working (eg) WDL pipeline. Are there any git repos implementing this that I could use to learn?

freeseek commented 2 years ago

Hi @ggstatgen, I have developed exactly that pipeline last summer and you can find some documentation here and the actual WDL here. It does not have all features but it can run binary and quantitative trait associations. It also requires the input data to be provided in VCF format, like the output of IMPUTE5. It runs step1 as two jobs, one that parallelizes across chromosomes and one that parallelizes across phenotypes. And it runs step2 parallelized across genome windows of user selected size. The user has a choice of running both steps or each step as a separate run. We have run it on large biobanks, including the UK biobank, and it can run overnight.

ggstatgen commented 2 years ago

Many thanks! I will look into this for sure. This is very helpful.