rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
187 stars 55 forks source link

Rare variants analysis #60

Closed saphir746 closed 3 years ago

saphir746 commented 3 years ago

Hello,

can you advice on the best way to perform rare variants (MAF < 1%) analysis in Regenie? Can this be supported via SPA / Frith correction? So far I have tried the following:

    plink2 --bed ukb_cal_ChrXX_rv.bed \
        --bim ukb_cal_ChrXX_rv.bim \
        --fam ukb_cal_ChrXX_rv.fam \
        --max-maf 0.01 --geno 0.1 --hwe 1e-15 \
        --mind 0.1 \
        --write-snplist --write-samples --no-id-header \
        --out qc_pass
    regenie \
        --step 1 \
        --bed ukb_cal_ChrXX_rv \
        --keep qc_pass2.id \
        --covarFile $inputs_covar \
        --covarColList "Age","Sex","gPC1","gPC2" \
        --phenoFile $inputs_pheno \
        --bt \
        --bsize 500 \
        --lowmem \
        --out ukb_step1_BT_rv

But I get this:

   block [1] : 500 snps  (25286ms) 
     -residualizing and scaling genotypes...!! Uh-oh, SNP rsZXY has low variance (=0).

I am running Regenie v. 1.0.6.7 inside a conda env

joellembatchou commented 3 years ago

Hi,

The error indicates that the variant rsZXY has no minor alleles (or very low minor allele count) in the sample analyzed. You can filter out such variants by specifying a minimum minor allele count filter in the PLINK2 command (e.g. --mac 100). This should avoid the low variance error in step 1.

Also note that we recommend to run step 1 of Regenie on common variants after applying appropriate QC filters as the aim of this step is to capture genome-wide polygenic effects (similar to what is done in LMM methods). Then, using these predictions, you can run step 2 (association testing) on another set of variants (e.g. rare variants).

If you are analyzing binary traits, it has been noted that standard association tests with rare variants can be inflated. To avoid this, we have implemented SPA & Firth tests. They can be specified by options --spa/--firth (see documentation for more details). Note that we also have an approximate Firth test which highly reduces the computational time of the standard Firth test and leads to very similar association results (use --firth --approx).

Cheers, Joelle

saphir746 commented 3 years ago

Hello Joelle,

thanks a lot for the detailed answer. Running step 1 on common variants + normal QC then step 2 on rare variants makes sense. How would you do that in practice though? Would you produce 2 sets of bed / bgen files for each step, with different QC filters?

    plink2 --bgen $bgen 'ref-unknown'\
      --sample $sam \
      --maf 0.01 --geno 0.1 --hwe 1e-15 \
      --mind 0.1 \
      --make-bed \
      --out ukb_cal_step1

    plink2 --bgen $bgen 'ref-unknown'\
      --sample $sam \
      --max-maf 0.01 --geno 0.1 --hwe 1e-15 \
      --mind 0.1 \
      --mac 100 \
      --make-bed \
      --out ukb_cal_step2

    regenie --step 1 \
        --bed ukb_cal_step1 \
        ...
        --out ukb_step1

   regenie --step 2 \
        --bed ukb_cal_step2 \
        ...
        --pred ukb_step1_pred.list \
        --out ukb_step2
joellembatchou commented 3 years ago

Hi,

Regenie can handle files in BGEN or PLINK bed format for step 1/2. Also, there is a --extract/--exclude option that works for step 1 and 2 so when applying the appropriate filters for step 1 and 2 in PLINK2, use option --write-snplist so you can get the list of variants that pass the filters and avoid making a new genotype file.

So based on the example you included above,

plink2 --bgen $bgen 'ref-unknown'\
      --sample $sam \
      --maf 0.01 --geno 0.1 --hwe 1e-15 \
      --mind 0.1 \
      --write-snplist \
      --out ukb_cal_step1

plink2 --bgen $bgen 'ref-unknown'\
      --sample $sam \
      --max-maf 0.01 --geno 0.1 --hwe 1e-15 \
      --mind 0.1 \
      --mac 100 \
      --write-snplist \
      --out ukb_cal_step2

regenie --step 1 \
        --bgen $bgen \
        --extract ukb_cal_step1.snplist \
        ...
        --out ukb_step1

regenie --step 2 \
        --bgen $bgen \
        --extract ukb_cal_step2.snplist \
        ...
        --pred ukb_step1_pred.list \
        --out ukb_step2

Note that for the PLINK2 command for step 1, I'd recommend also adding a --mac filter to ensure there are no variants with too low minor allele counts.

Cheers, Joelle

saphir746 commented 3 years ago

hello Joelle,

Thank you for the suggestions. With the following syntax, it worked:

    regenie --step 1 \
        --bed ukb_cal_ChrXXX \
        --extract ukb_cal_step1.snplist \
          .....
        --out ukb_step1_cont

   regenie --step 2 \
        --bgen $bgen \
        --sample $sam \
        --extract ukb_cal_step2.snplist  \
        --pred ukb_step1_cont_pred.list \
        ....
        --out ukb_step2

(Actually there was a problem with the processing of the bgen file, but thats a separate issue)