rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
182 stars 53 forks source link

Step 2 result file is empty #392

Closed Antonio-Nappi closed 1 year ago

Antonio-Nappi commented 1 year ago

Hi to everyone, I want to use Regenie to test the association between rare variants and a pheno of interest. Currently I am using UKBB data. To perform the step 1 I have run the following command:

source ~/.bashrc\nconda activate regenie_env\n
regenie \
--step 1 \
--bed ~/calls_chrom_whole_filtered \
--covarFile ~/covs_qced_eur_tab_sep.tsv  \
--phenoFile ~/regenie_pheno.tsv \
--bsize 200 \
--out ~/Regenie_v2/calls_file_pLOF_4.03 \
--iid-only \
--force-step1 \
--lowmem

As genetic data I am using the calls data (I tried with the imputed one but it requires too much memory and time), as reported in the documentation I have performed the preprocess of the data.

The log file of this step is the following:

Start time: Sat Mar  4 15:32:01 2023

              |=============================|
              |      REGENIE v3.2.3.gz      |
              |=============================|

Copyright (c) 2020-2022 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini.
Distributed under the MIT License.
Compiled with Boost Iostream library.
Using Intel MKL with Eigen.

Log of output saved in file : ~/Regenie_v2/calls_file_pLOF_4.03.log

Options in effect:
  --step 1 \
  --bed ~/calls_chrom_whole_filtered \
  --covarFile ~/covs_qced_eur_tab_sep.tsv \
  --phenoFile ~/regenie_pheno.tsv \
  --bsize 200 \
  --out ~/Regenie_v2/calls_file_pLOF_4.03 \
  --iid-only \
  --force-step1 \
  --lowmem

Fitting null model
 * bim              : [~/calls_chrom_whole_filtered.bim] n_snps = 584238
 * fam              : [~/calls_chrom_whole_filtered.fam] n_samples = 356725
 * bed              : [~/calls_chrom_whole_filtered.bed]
 * phenotypes       : [~/regenie_pheno.tsv] n_pheno = 1
   -dropping observations with missing values at any of the phenotypes
   -number of phenotyped individuals with no missing data = 355556
 * covariates       : [~/covs_qced_eur_tab_sep.tsv] n_cov = 27
   -number of individuals with covariate data = 356725
 * number of individuals used in analysis = 355556
 * number of observations for each trait:
   - '21001-0.0': 355556 observations
   -residualizing and scaling phenotypes...done (61ms) 
 * # threads        : [95]
 * block size       : [200]
 * # blocks         : [2935] for 584238 variants
 * # CV folds       : [5]
 * ridge data_l0    : [5 : 0.01 0.25 0.5 0.75 0.99 ]
 * ridge data_l1    : [5 : 0.01 0.25 0.5 0.75 0.99 ]
 * approximate memory usage : 40GB
 * writing level 0 predictions to disk
   -temporary files will have prefix [~/Regenie_v2/calls_file_pLOF_4.03_l0_Y]
   -approximate disk space needed : 40GB
 * setting memory...done
Chromosome 1
 block [1] : 200 snps  (435ms) 
   -residualizing and scaling genotypes...done (1155ms) 
   -calc working matrices...done (188ms) 
   -calc level 0 ridge...done (855ms) 
 block [2] : 200 snps  (308ms) 
.....

Level 1 ridge...
   -on phenotype 1 (21001-0.0)...done (1692828ms) 

Output
------
phenotype 1 (21001-0.0) : 
  0.01  : Rsq = 0.111594, MSE = 0.909643
  0.25  : Rsq = 0.116216, MSE = 0.884353<- min value
  0.5   : Rsq = 0.116024, MSE = 0.885864
  0.75  : Rsq = 0.115462, MSE = 0.887859
  0.99  : Rsq = 0.114795, MSE = 0.892405
  * making predictions...writing LOCO predictions...done (192614ms) 

List of blup files written to: [~/Regenie_v2/calls_file_pLOF_4.03_pred.list]

Elapsed time : 8248.35s
End time: Sat Mar  4 17:49:29 2023

Once finished I run the step 2 with the following command:

source ~/.bashrc\nconda activate regenie_env\n
    regenie \
      --step 2 \
      --pgen ~/pgen-v2/ukb23159_c4_b0_v1 \
      --covarFile ~/covs_qced_eur_tab_sep.tsv \
      --phenoFile ~/phenos/regenie_pheno.tsv \
      --pred /~/Regenie_v2/calls_file_pLOF_4.03_pred.list \
      --anno-file ~/pgen-v2-filtered-maf0.01-eur/ukb23159_c4_b0_v1.regenie.pann \
      --set-list ~/pgen-v2-filtered-maf0.01-eur/ukb23159_c4_b0_v1.set_list \
      --mask-def ~/pgen-filtered-maf0.01-eur/mask_definition.txt \
      --bsize 200 \
      --out Regenie_v2/calls_file_pLOF_4.03_step_2 \
      --aaf-bins 0.01,0.005 \
      --debug \
      --minMAC 0.5 \
      --check-burden-files \
      --strict-check-burden \

and the log file is the following:

Start time: Sun Mar  5 16:14:37 2023

              |=============================|
              |      REGENIE v3.2.3.gz      |
              |=============================|

Copyright (c) 2020-2022 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini.
Distributed under the MIT License.
Compiled with Boost Iostream library.
Using Intel MKL with Eigen.

Log of output saved in file : Regenie_v2/calls_file_pLOF_4.03_step_2_c4_ref_first.log

Options in effect:
  --step 2 \
  --pgen ~/pgen-v2/ukb23159_c4_b0_v1 \
  --covarFile ~/covs_qced_eur_tab_sep.tsv \
  --phenoFile ~/regenie_pheno.tsv \
  --pred ~/Regenie_v2/calls_file_pLOF_4.03_pred.list \
  --anno-file ~/pgen-v2-filtered-maf0.01-eur/ukb23159_c4_b0_v1.regenie.pann \
  --set-list~/pgen-v2-filtered-maf0.01-eur/ukb23159_c4_b0_v1.set_list \
  --mask-def ~/pgen-filtered-maf0.01-eur/mask_definition.txt \
  --bsize 200 \
  --out Regenie_v2/calls_file_pLOF_4.03_step_2_c4_ref_first \
  --aaf-bins 0.01,0.005 \
  --debug \
  --minMAC 0.5 \
  --check-burden-files \
  --strict-check-burden \
  --ref-first

Association testing mode (joint tests) with fast multithreading using OpenMP
 * pvar             : [~/pgen-v2/ukb23159_c4_b0_v1.pvar] n_snps = 1088878
 * psam             : [~/pgen-v2/ukb23159_c4_b0_v1.psam] n_samples = 469835
 * pgen             : [~/pgen-v2/ukb23159_c4_b0_v1.pgen] 
 * phenotypes       : [~/regenie_pheno.tsv] n_pheno = 1
   -dropping observations with missing values at any of the phenotypes
   -number of phenotyped individuals with no missing data = 467886
 * covariates       : [~/covs_qced_eur_tab_sep.tsv] n_cov = 27
   -number of individuals with covariate data = 343487
 * number of individuals used in analysis = 342365
 * number of observations for each trait:
   - '21001-0.0': 342365 observations
 * LOCO predictions : [~/Regenie_v2/calls_file_pLOF_4.03_pred.list]
   -file [~/Regenie_v2/calls_file_pLOF_4.03_1.loco] for phenotype '21001-0.0'
   -residualizing and scaling phenotypes...done (10ms) 
 * annotations      : [~/pgen-v2-filtered-maf0.01-eur/ukb23159_c4_b0_v1.regenie.pann] 
   +number of annotations categories = 4
 * masks            : [~/pgen-filtered-maf0.01-eur/mask_definition.txt] n_masks = 1
 * aaf cutoffs      : [ 2 : 0.005 0.01 ] + singletons
 * set file         : [~/pgen-v2-filtered-maf0.01-eur/ukb23159_c4_b0_v1.set_list] n_sets = 720
     +report on burden input files written to [Regenie_v2/calls_file_pLOF_4.03_step_2_c4_ref_first_masks_report.txt]
 * # threads        : [95]
 * # tested sets    : [720]
 * max block size   : [200]
 * rule used to build masks : max
 * approximate memory usage : 2GB
 * rng seed : 1883760746
 * using minimum MAC of 0.5 (masks with lower MAC are ignored)

Chromosome 4 [720 sets in total]
   -reading loco predictions for the chromosome...done (273ms) 
 set [1/720] : ZNF595 - 26 variants...1 chunks
     -reading in genotypes and building masks...(1)memory usage=721MB...(2)memory usage=732MB...WARNING: 3/3 masks fail MAC filter and will be skipped...(3)memory usage=761MB...(4)memory usage=761MB...done (165ms) 

The same error message is reported for all the variants annotated. How could I fix it?

joellembatchou commented 1 year ago

Hi,

Can you run with --write-mask-snplist to see which variants are going into your masks for that first set (ZNF595) [use --nb 1 to only run first set)? Also please run with most recent REGENIE release.

Cheers, Joelle