rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
173 stars 52 forks source link

Burden testing issue - input variants not in snplist or masks_report? #509

Open nikotzoumas opened 3 months ago

nikotzoumas commented 3 months ago

Hi Joelle,

I hope you're well. I'd be very grateful if you could please help me with a confusing issue in burden testing where variants from certain gene sets are:

I have tried:

Do you please know of situation when Regenie might include in its output variants for GeneX but not GeneY in the following scenario?

Thanks in advance for your time, Nik

The mask file has format:

M1 pathogenic
M2 tolerated
...

The sets file has format:

GeneX,4,109729982,4:109729982:G:C,...4:109731008:G:A,4:109731012:C:A,...
GeneY,10,122453658,10:122453658:G:A,...10:122454728:A:G,10:122454733:G:A,...
...

The annotations file has format (tab-delimited):

4:109731008:G:A GeneX pathogenic
4:109731012:C:A GeneX tolerated
10:122454728:A:G    GeneY pathogenic
10:122454733:G:A    GeneY tolerated
...

The PLINK .bim file has format (tab-delimited):

4   4:109731008:G:A 0   109731008   A   G
4   4:109731012:C:A 0   109731012   A   C
10  10:122454728:A:G    0   122454728   G   A
10  10:122454733:G:A    0   122454733   A   G
...

Log:

Start time: Thu Apr  4 16:54:11 2024

              |===========================|
              |      REGENIE v3.3.gz      |
              |===========================|

Copyright (c) 2020-2023 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini.
Distributed under the MIT License.
Compiled with Boost Iostream library.
Using Intel MKL with Eigen.

Log of output saved in file : test.log

Options in effect:
  --step 2 \
  --pred test_pred.list \
  --bed /mnt/project/gatk_qc_final_recoded \
  --phenoFile /mnt/project/regenie-files/pheno.txt \
  --covarFile /mnt/project/regenie-files/covar.txt \
  --phenoCol outcome \
  --covarColList  \
  --set-list /mnt/project/regenie-files/anno_sets.txt \
  --anno-file /mnt/project/regenie-files/anno_lof.txt \
  --mask-def /mnt/project/regenie-files/mask_lof.txt \
  --build-mask max \
  --bsize 200 \
  --check-burden-files \
  --write-mask-snplist \
  --bt \
  --firth \
  --approx \
  --firth-se \
  --pThresh 0.05 \
  --out test

Association testing mode (joint tests) with fast multithreading using OpenMP
 * bim              : [/mnt/project/wgs.bim] n_snps = 427423
 * fam              : [/mnt/project/wgs.fam] n_samples = 968906
 * bed              : [/mnt/project/wgs.bed]
 * phenotypes       : [/mnt/project/regenie-files/pheno.txt] n_pheno = 1
   -dropping observations with missing values at any of the phenotypes
   -number of phenotyped individuals with no missing data = 927260
 * covariates       : [/mnt/project/regenie-files/covar.txt] n_cov = 0 (+ intercept)
 * number of individuals used in analysis = 927260
 * case-control counts for each trait:
   - 'outcome': 32904 cases and 894356 controls
 * LOCO predictions : [test_pred.list]
   -file [/opt/notebooks/test_1.loco.gz] for phenotype 'outcome'
    + 12900 individuals with missing LOCO predictions will be ignored for the trait
 * annotations      : [/mnt/project/regenie-files/anno_lof.txt] 
   +number of annotations categories = 3
 * masks            : [/mnt/project/regenie-files/mask_lof.txt] n_masks = 2
 * aaf cutoffs      : [ 1 : 0.01 ] + singletons
 * set file         : [/mnt/project/regenie-files/anno_sets.txt] n_sets = 18
WARNING: Detected 18 sets with variants not in genetic data or annotation files.
WARNING: Detected 1 sets with only unknown variants (these are ignored).
     +report on burden input files written to [test_masks_report.txt]
 * # threads        : [35]
 * # tested sets    : [18]
 * max block size   : [200]
 * rule used to build masks : max
 * approximate memory usage : 4GB
 * writing list of variants for each mask in file [test_masks.snplist]
 * using minimum MAC of 5 (masks with lower MAC are ignored)
 * using fast Firth correction for logistic regression p-values less than 0.05
    - using back-correction to compute Firth SE
...
Chromosome 4 [1 sets in total]
   -reading loco predictions for the chromosome...done (1054ms) 
   -fitting null logistic regression on binary phenotypes...done (246ms) 
   -fitting null Firth logistic regression on binary phenotypes...done (100ms) 
 set [8/18] : GeneX - 59 variants...
     -reading in genotypes and building masks...done (362ms) 
     -computing association tests...done (272ms) 
...
Chromosome 10 [2 sets in total]
   -reading loco predictions for the chromosome...done (1450ms) 
   -fitting null logistic regression on binary phenotypes...done (224ms) 
   -fitting null Firth logistic regression on binary phenotypes...done (96ms) 
 set [16/18] : GeneY - 11 variants...
     -reading in genotypes and building masks...done (25ms) 
...

Association results stored separately for each trait in files : 
* [test_outcome.regenie]

Number of tests with Firth correction : 5
Number of failed tests : (0/5)
Number of ignored tests due to low MAC : 0

Elapsed time : 24.047s
End time: Thu Apr  4 16:54:35 2024

.snplist (truncated) no GeneY variants

GeneX.M1.0.01   4:109731008:G:A,...
GeneX.M2.0.01   4:109731012:C:A,...
...

masks_report (truncated): some but not all GeneY variants are in this file

## mask file: [/mnt/project/regenie-files/mask_lof.txt]
## list of unknown annnotations in mask file
->Detected 0 masks with unknown annotations.
->Detected 0 masks with only unknown annotations.
GeneX 4:109729982:G:C,4:109729983:C:T,...
GeneY 10:122453658:G:A,10:122453659:C:T,...
...

I cannot attempt the above on the latest versions of Regenie (v3.4+) as it appears my Jupyter bash environment may not have the necessary dependencies:

./regenie: /lib/x86_64-linux-gnu/libmvec.so.1: version `GLIBC_2.35' not found (required by ./regenie)
./regenie: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by ./regenie)
./regenie: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by ./regenie)
nikotzoumas commented 3 months ago

240408 Update: This was probably related to how REGENIE accesses genetic data within a DNAnexus project, as downloading the data within my JupyterLab instance resolved this issue (relevant to UK Biobank users).

Is it also possible that REGENIE is sensitive to internal inconsistencies in use of space or tab-delimitation within annotation or sets files, or the use of asterisks or dashes to denote indels within genotype, annotation, and sets files. I am not sure if this is the case please Joelle?

Perhaps future researchers would find an example annotations file on the software's Documentation to be useful, in addition to the other excellent resources there.

Many thanks, Nik

joellembatchou commented 1 month ago

Hi,

Just to clarify, you only downloaded the same input files locally & re-run the exact same command and obtained results for GeneY which previously was absent from sumstats file?

REGENIE handles space/tab separated input files for gene-based testing (and does not assume it has to be the same format across files). I don't expect for special characters in the variant name to cause an issue unless they don't match with the variant IDs in the input genotype file... The log indicates 11 variants were included in GeneY, is there any mask for that gene in the sumstats file? (it's hard to tell from the truncated log you included)