rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
187 stars 55 forks source link

Running regenie on many phenotypes fails due to standardization? #331

Closed guhanrv closed 2 years ago

guhanrv commented 2 years ago

Hello,

I am interested in running REGENIE on many (on the order of ~20K) quantitative phenotypes. For now, I am just interested in a simple linear regression model (step 2). I am interested in running this as one job, and not as 20K separate jobs (hence my draw toward REGENIE over other algorithms). The code that I have written for this is as follows:

regenie \
        --threads 32 \
        --step 2 \
        --bed genotypes/gsa_merge.no_fid \
        --remove to_remove \
        --bsize 200 \
        --chrList 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22 \
        --phenoFile phenos \
        --covarFile covars \
        --covarCol Sex --covarCol Hispanic --covarCol Age --covarCol gsa_batch --covarCol msr_seer_instrument_name --covarCol PC{1:10} \
        --out results/regenie/${pheno_col} \
        --gz \
        --ignore-pred

When I run this, there are several errors that pop up, indicating fully empty columns (e.g. ERROR: all individuals have missing/invalid values for phenotype 'S-007-162_A0A075B6S9'.). So I go ahead and add a --phenoExcludeList to the parameters that takes out all columns that have 1) all phenotypes missing and 2) less than 1 unique value.

However, after doing this preprocessing, different errors pop up:

 * no step 1 predictions given. Simple linear regression will be performed
   -residualizing and scaling phenotypes...ERROR: phenotype 'S-229-079_Q9Y6Z7' has sd=0.

When I look at the raw data, this phenotype doesn't have a SD of zero - it has an SD of 0.3. Is something happening when I am inputting all these 20K phenotypes, where some phenotypes are getting "standardized out"? Is there a way to control the standardization process? Thanks in advance.

joellembatchou commented 2 years ago

Hi,

The sd is computed after projecting out covariates; the error message indicates that after projecting out covariates (ir taking residuals), the phenotype "S-229-079_Q9Y6Z7" has sd=0.