rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
182 stars 53 forks source link

Trouble with set inclusion/exclusion file format #548

Open cbostwick87 opened 3 weeks ago

cbostwick87 commented 3 weeks ago

Hello, I am trying to perform gene-based testing on UKB data (following the tutorial found here: https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/using-regenie-to-generate-variant-masks), but I am having difficulty correctly specifying the set inclusion/exclusion file format (as given by this parameter: --extract-setlist). I am using the following (standard for the UKB) parameters:

      --set-list ukb23158_500k_OQFE.sets.txt.gz \
      --anno-file ukb23158_500k_OQFE.annotations.txt.gz \
      --mask-def regenie_custom.masks \
      --nauto 23 \
      --extract-setlist extract_gene_names.txt \
      --aaf-bins 0.01,0.001

The set list file (ukb23158_500k_OQFE.sets.txt.gz) has this format:

A1BG(ENSG00000121410)   19  58347026    19:58347026:C:G,19:58347026:C:T,19:58347030:C:G...
A1CF(ENSG00000148584)   10  50806736    10:50806736:G:A,10:50806738:G:A,10:50806740:C:T...

The annotation file (ukb23158_500k_OQFE.annotations.txt.gz) looks like this:

1:69134:A:G OR4F5(ENSG00000186092) missense(0/5)
1:69144:C:T OR4F5(ENSG00000186092) synonymous
1:69149:T:A OR4F5(ENSG00000186092) missense(>=1/5)
1:69217:G:A OR4F5(ENSG00000186092) missense(0/5)
1:69224:A:T OR4F5(ENSG00000186092) missense(>=1/5)

and the mask def file (regenie_custom.masks) looks like this:

M1 LoF
M2 LoF,missense(5/5)

My extract set list file (extract_gene_names.txt) is formatted like this:

ACVR2B(ENSG00000114739)
ADAMTS2(ENSG00000087116)
ALCAM(ENSG00000170017)
ANO5(ENSG00000171714)
BMP1(ENSG00000168487)

a single column of set/gene names corresponding to those in the set list file (as stated in the regenie documentation).

When I run regenie step 2 using these settings on DNAnexus, the job fails, and I get an error log that states:

WARNING: Detected 422 sets with variants not in genetic data or annotation files.
WARNING: Detected 18563 sets with only unknown variants (these are ignored).
+report on burden input files written to [assoc.c22_masks_report.txt]
-keeping only specified sets
ERROR: no set left to include in analysis.`

for each of the chromosomes. If I input the --extract-setlist option like this (instead of providing a file): --extract-setlist \"COL1A1(ENSG00000108821)\",\"COL1A2(ENSG00000164692)\" \ the job runs successfully and I get the expected output (association stats for the 2 COL1A genes).

I plan to run regenie with many different set lists, and I would prefer to keep them organized in separate files, instead of having to manually format them in this manner. My question is what am I doing wrong when I provide the extract_gene_names.txt file? I have tried many different formats including providing trailing commas on each line:

ACVR2B(ENSG00000114739),
ADAMTS2(ENSG00000087116),
ALCAM(ENSG00000170017),
ANO5(ENSG00000171714),
BMP1(ENSG00000168487)

including quotes (escaped with \ and also not using an escape \ character ):

"ACVR2B(ENSG00000114739)"
"ADAMTS2(ENSG00000087116)"
"ALCAM(ENSG00000170017)"
"ANO5(ENSG00000171714)"
"BMP1(ENSG00000168487)"

putting all the genes on the same line:

ACVR2B(ENSG00000114739),ADAMTS2(ENSG00000087116),ALCAM(ENSG00000170017),ANO5(ENSG00000171714),BMP1(ENSG00000168487)

as well as combinations of these formats (commas and quotes, quotes and escape \ on one line, quotes without escape \ on one line, etc.), but nothing seems to work. I also ensured that the line endings of the file are in UNIX (LF) format (I encountered an error previously using a file that had Windows (CRLF) end of line sequences.

Here is the complete regenie log in case that is helpful:

|=============================|
| REGENIE v3.1.1.gz |
|=============================|
Copyright (c) 2020-2022 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini.
Distributed under the MIT License.
Compiled with Boost Iostream library.
Log of output saved in file : assoc.c22.log
Options in effect:
--step 2 \
--bed ukb23158_c22_b0_v1 \
--out assoc.c22 \
--phenoFile phenotype.txt \
--covarFile covariates.txt \
--bt \
--approx \
--firth-se \
--firth \
--extract step2_c22_snps_qc_pass.snplist \
--phenoCol PROTECTED \
--covarCol SEX \
--af-cc \
--test additive \
--pred step1_pred.list \
--bsize 200 \
--set-list ukb23158_500k_OQFE.sets.txt.gz \
--anno-file ukb23158_500k_OQFE.annotations.txt.gz \
--mask-def regenie_custom.masks \
--nauto 23 \
--extract-setlist extract_gene_names.txt \
--check-burden-files \
--aaf-bins 0.01,0.001 \
--pThresh 0.05 \
--minMAC 3 \
--threads 16 \
--gz
Association testing mode (joint tests) with fast multithreading using OpenMP
* bim : [ukb23158_c22_b0_v1.bim] n_snps = 613853
-keeping variants specified by --extract
-number of variants remaining in the analysis = 283275
* fam : [ukb23158_c22_b0_v1.fam] n_samples = 469835
* bed : [ukb23158_c22_b0_v1.bed]
* phenotypes : [phenotype.txt] n_pheno = 1
-dropping observations with missing values at any of the phenotypes
-number of phenotyped individuals with no missing data = 469728
* covariates : [covariates.txt] n_cov = 1
-number of individuals with covariate data = 469728
* number of individuals used in analysis = 469728
* case-control counts for each trait:
- 'PROTECTED': 158 cases and 469570 controls
* LOCO predictions : [step1_pred.list]
-file [/home/dnanexus/out/out/step1_1.loco.gz] for phenotype 'PROTECTED'
+ 383 individuals with missing LOCO predictions will be ignored for the trait
* annotations : [ukb23158_500k_OQFE.annotations.txt.gz]
+number of annotations categories = 6
* masks : [regenie_custom.masks] n_masks = 4
* aaf cutoffs : [ 2 : 0.001 0.01 ] + singletons
* set file : [ukb23158_500k_OQFE.sets.txt.gz] n_sets = 422
WARNING: Detected 422 sets with variants not in genetic data or annotation files.
WARNING: Detected 18563 sets with only unknown variants (these are ignored).
+report on burden input files written to [assoc.c22_masks_report.txt]
-keeping only specified sets
ERROR: no set left to include in analysis.

Can anyone point out what is wrong with my set list file? I would greatly appreciate any assistance! Thank you very much!

dvg-p4 commented 1 day ago

--extract-setlist extract_gene_names.txt \ should be --extract-sets extract_gene_names.txt \; the former expects a list on the command line, the latter expects a file.