rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
187 stars 55 forks source link

Regenie V3 SKAT/ACAT Error #252

Closed GHawkes93 closed 2 years ago

GHawkes93 commented 2 years ago

Hi,

Thanks for updating REGENIE again - we've really appreciated the software you've made available in our department.

I've found that when running any SKAT/ACAT test, I get the following error:

Chromosome 1 [4293 sets in total] -reading loco predictions for the chromosome...done (462ms) set [1/4293] : 349 variants... -reading in genotypes, computing gene-based tests and building masks...regenie_v3.0.gz_x86_64_Centos7_mkl: ./external_libs/eigen3/Eigen/src/Core/PlainObjectBase.h:312: void Eigen::PlainObjectBase::resize(Eigen::Index) [with Derived = Eigen::Matrix<double, -1, 1>; Eigen::Index = long int]: Assertion `((SizeAtCompileTime == Dynamic && (MaxSizeAtCompileTime==Dynamic || size<=MaxSizeAtCompileTime)) || SizeAtCompileTime == size) && size>=0' failed. Aborted

I've tested this with both the Linux and Centos pre-built binaries, and am using the following input

./regenie_v3.0.gz_x86_64Linux \ --step 2 \ --bgen ukb${chr}.bgen \ --sample ukb_${chr}.sample \ --phenoFile Phenotype_TSV \ --covarFile Covariate_TSV \ --bsize 400 \ --catCovarList ${catvars} \ --maxCatLevels 30 \ --exclude ukb23149_450k_OQFE.90pct10dp_qc_variants.txt \ --apply-rint \ --vc-tests skat,skato,skato-acat,acatv,acato,acato-full \ --vc-maxAAF 0.01 \ --joint minp,acat,nnls \ --anno-file annotations_chr${chr}.txt \ --set-list set_list_chr${chr}.txt \ --mask-def masks \ --pred ${pheno}_Step1_pred.list \ --check-burden-files \ --out ${output}/${pheno}_Step2_Chr${chr}

I've found that the joint test command works if I remove the --vc flags, but the --vc flags do not work when the joint flag is removed. I've also tried running only one --vc test at a time and this did not resolve things.

Thanks for all of your work!

Just as a general point, I've found that the pre-built binaries don't work out of the box on UKBB's DNA Nexus platform - I had to run the following commands before it would run:

sudo apt-get install gcc-7 g++-7 sudo apt-get install gfortran-7

Best wishes, Gareth

joellembatchou commented 2 years ago

Hi Gareth,

Seems I am unable to replicate this error (specifying both --vc-tests and --joint)... Can you include the full log of the run? Also, just as a rough check, can you try skipping that first set using --starting-block 2 and see if the error still occurs?

Indeed, this version requires having GFortran installed as we use Fortran libraries for computing SKATO p-values (I have now made it explicit on the Release page).

Cheers, Joelle

GHawkes93 commented 2 years ago

Thanks Joelle.

I tried what you suggested, and did get a slightly different error:

############

Chromosome 2 [2773 sets in total] -reading loco predictions for the chromosome...done (277ms) set [10/2773] : 80 variants... -reading in genotypes, computing gene-based tests and building masks...regenie_v3.0.gz_x86_64_Linux: ./external_libs/eigen3/Eigen/src/Core/Redux.h:413: typename Eigen::internal::traits::Scalar Eigen::DenseBase::redux(const Func&) const [with BinaryOp = Eigen::internal::scalar_max_op<double, double>; Derived = Eigen::CwiseUnaryOp<Eigen::internal::scalar_abs_op, const Eigen::Matrix<double, -1, -1> >; typename Eigen::internal::traits::Scalar = double]: Assertion `this->rows()>0 && this->cols()>0 && "you are using an empty matrix"' failed. Aborted

######## Here's a full log file: interestingly the error doesn't make it to the log. ####################

Start time: Sat Mar 12 17:25:52 2022

          |===========================|
          |      REGENIE v3.0.gz      |
          |===========================|

Copyright (c) 2020-2022 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini. Distributed under the MIT License. Compiled with Boost Iostream library.

Log of output saved in file : bmi_regenie_burden_dnanexus_2022-03-12/bmi_Step2_Chr2.log

Options in effect: --step 2 \ --bgen ukb_2.bgen \ --sample ukb_2.sample \ --phenoFile Phenotype_TSV \ --vc-maxAAF 0.01 \ --joint minp,acat,nnls \ --mask-def masks \ --pred bmi_Step1_pred.list \ -number of individuals with covariate data = 419854

Association testing mode (joint tests) with fast multithreading using OpenMP

Chromosome 2 [2773 sets in total] -reading loco predictions for the chromosome...done (276ms) set [2/2773] : 525 variants... -reading in genotypes, computing gene-based tests and building masks...

#################################################

Thanks for all your help!

Cheers, Gareth

GHawkes93 commented 2 years ago

Just doing some exploration - the error seems to be tied to the --minMAC. I had it set to 1, but if I increase it some sets don't fail. Not entirely sure how yet!

Edit; It definitely seems to be --minMAC, although I still get the error with the default of 5. The higher I set it, the less likely a set is to error.

joellembatchou commented 2 years ago

Could you try running with --debug? It will give a more verbose output to stderr but will help better identify where the error is coming from.

GHawkes93 commented 2 years ago

Thanks Joelle - I've included the full log below.

It looks as though, for the skat test, one of the masks is empty ("#in mask = 0"). However, we've checked the genotype file generated by --write-mask and this doesn't seem to be true...

We did try removing all singletons as an experiment, and had the same error. So we're a bit confused!

####################################

Start time: Tue Mar 15 09:11:38 2022
              |==========================|
              |        REGENIE v3.0      |
              |==========================|
Copyright (c) 2020-2022 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini.
Distributed under the MIT License.
Log of output saved in file : shbg_r114w_raw_sin_regenie_burden_dnanexus_2022-03-15/shbg_Step2_Chr20.log
Options in effect:
  --step 2 \
  --bgen ukb_20.bgen \
  --sample ukb_20.sample \
  --phenoFile Phenotype_TSV \
  --covarFile Covariate_TSV \
  --bsize 400 \
  --minMAC 1 \
  --catCovarList sex,centre,wes_batch \
  --maxCatLevels 30 \
  --exclude ukb23149_450k_OQFE.90pct10dp_qc_variants.txt \
  --apply-rint \
  --vc-tests skat,skato,skato-acat,acatv,acato,acato-full \
  --vc-maxAAF 0.01 \
  --joint minp,acat,nnls \
  --debug \
  --anno-file annotations_chr20.txt \
  --set-list set_list_chr20.txt \
  --mask-def masks \
  --pred shbg_r114w_Step1_pred.list \
  --aaf-bins 0.001,0.0001 \
  --check-burden-files \
  --out shbg_raw_sin_regenie_burden_dnanexus_2022-03-15/shbg_Step2_Chr20
Association testing mode (joint tests) with fast multithreading using OpenMP
 * bgen             : [ukb_20.bgen]
   -summary : bgen file (v1.2 layout, zlib compressed) with 454756 named samples and 677089 variants with 8-bit encoding.
   -index bgi file [ukb_20.bgen.bgi]
   -removing variants specified by --exclude
   -number of variants remaining in the analysis = 550780
   -sample file: ukb_20.sample 
* phenotypes       : [Phenotype_TSV] n_pheno = 1
   -dropping observations with missing values at any of the phenotypes
   -number of phenotyped individuals with no missing data = 363403
 * covariates       : [Covariate_TSV] n_cov = 45
   -number of individuals with covariate data = 419854
 * number of individuals used in analysis = 363403
   -applying RINT to all phenotypes
 * LOCO predictions : [shbg_Step1_pred.list]
   -file [/home/dnanexus/shbg_Step1_1.loco] for phenotype 'shbg'
   -residualizing and scaling phenotypes...done (76ms) 
 * annotations      : [annotations_chr20.txt] 
   +number of annotations categories = 13
 * masks            : [masks] n_masks = 19
WARNING: Detected 7 masks with unknown annotations.
 * aaf cutoffs      : [ 3 : 0.0001 0.001 0.01 ] + singletons
 * set file         : [set_list_chr20.txt] n_sets = 1163
WARNING: Detected 989 sets with variants not in genetic data or annotation files.
     +report on burden input files written to [shbg_raw_sin_regenie_burden_dnanexus_2022-03-15/shbg_Step2_Chr20_masks_report.txt]
 * # threads        : [15]
 * # tested sets    : [1163]
 * max block size   : [400]
 * rule used to build masks : max
 * computing gene-based tests for each set of variants included in a mask (rho=[0,0.01,0.04,0.09,0.16,0.25,0.5,1])
  -variants with MAC <= 10 are collapsed into a mask
  -weights are obtained from Beta(MAF,1,25)
 * approximate memory usage : 2GB
 * using minimum MAC of 1 (masks with lower MAC are ignored)
Chromosome 20 [1163 sets in total]
   -reading loco predictions for the chromosome...done (279ms) 
 set [1/1163] : 301 variants...1 chunks
set #4; rare_mask [mu,nZ,w,w_a] = [0.000368737,134,24.8896,0.114194]
set #8; rare_mask [mu,nZ,w,w_a] = [0.000162354,59,24.9513,0.0505343]
set #12; rare_mask [mu,nZ,w,w_a] = [0.000126581,46,24.9621,0.0394341]
set #20; rare_mask [mu,nZ,w,w_a] = [6.60424e-05,24,24.9802,0.0206049]
set #24; rare_mask [mu,nZ,w,w_a] = [0.000828282,301,24.7527,0.253637]
set #28; rare_mask [mu,nZ,w,w_a] = [0.000883317,321,24.7363,0.270126]
set #32; rare_mask [mu,nZ,w,w_a] = [0.000231148,84,24.9307,0.0718259]
set #36; rare_mask [mu,nZ,w,w_a] = [0.000231148,84,24.9307,0.0718259]
set #40; rare_mask [mu,nZ,w,w_a] = [4.40283e-05,16,24.9868,0.013744]
set #44; rare_mask [mu,nZ,w,w_a] = [4.40283e-05,16,24.9868,0.013744]
set #48; rare_mask [mu,nZ,w,w_a] = [0.000393503,143,24.8822,0.12179]
set #52; rare_mask [mu,nZ,w,w_a] = [0.000393503,143,24.8822,0.12179]
set #56; rare_mask [mu,nZ,w,w_a] = [0.000206382,75,24.9382,0.0641692]
set #60; rare_mask [mu,nZ,w,w_a] = [0.000206382,75,24.9382,0.0641692]
set #64; rare_mask [mu,nZ,w,w_a] = [0.00035773,130,24.8929,0.110815]
set #68; rare_mask [mu,nZ,w,w_a] = [0.00035773,130,24.8929,0.110815]
set #72; rare_mask [mu,nZ,w,w_a] = [0.000170609,62,24.9489,0.0530931]
set #76; rare_mask [mu,nZ,w,w_a] = [0.000170609,62,24.9489,0.0530931]
Q_SKAT for all masks:
          0           0           0  1.6202e+06           0           0           0     24223.4           0           0           0     7062.74           0           0           0           0           0           0           0     37.6004           0           0           0 1.00525e+06           0           0           0  1.0151e+06           0           0           0      644617           0           0           0      644617           0           0           0      556768           0           0           0      556768           0           0           0      590011           0           0           0      590011           0           0           0      517106           0           0           0      517106           0           0           0      609077           0           0           0      609077           0           0           0      529293           0           0           0      529293
Q_BURDEN for all masks:
          0           0           0 4.70948e+06           0           0           0       24804           0           0           0     7376.16           0           0           0           0           0           0           0     15.8105           0           0           0     1823.06           0           0           0     545.949           0           0           0      756493           0           0           0      756493           0           0           0      467509           0           0           0      467509           0           0           0 1.05539e+06           0           0           0 1.05539e+06           0           0           0      708217           0           0           0      708217           0           0           0      913558           0           0           0      913558           0           0           0      592743           0           0           0      592743
#in mask=23
SV log10p:
 0.0452755   0.364048    1.73605   0.466274  0.0265962  0.0495139   0.253057   0.840524   0.212322   0.088827   0.215281   0.046342 0.00187424   0.207645    1.23746   0.441253   0.660409   0.090184   0.526401   0.190597  0.0368002   0.181863  0.0298006
Wsq:
0.00945218    1.22137  0.0197479  0.0419982  0.0111701   0.690306  0.0180341   0.418117  0.0163177   0.013744 0.00945218  0.0257459  0.0240331  0.0522404  0.0624665  0.0786241  0.0368709  0.0505344  0.0334492  0.0163177  0.0180329  0.0334491   0.114194
L:
6869.15  7197.6 8380.83 10235.7 11857.7   12294 13100.9 13820.4 15465.4 17979.6 20170.4   24302 25707.1 28664.3 33058.2 37344.9 42640.8 52607.2 72986.8  152837  376734  684795
[muQ, scFac]=1.66905e+06 0.860811
tau=     523651      540218      589917      672750      788716      937815 1.35198e+06 2.17865e+06
Q:
 1.6202e+06 1.65109e+06 1.74377e+06 1.89824e+06 2.11449e+06 2.39252e+06 3.16484e+06 4.70639e+06
skat-acat logp=0.254632 0.263425 0.290334 0.336663 0.403534  0.48992 0.692087 0.848649
acato logp=0.133924 0.254632 0.263425 0.290334 0.336663 0.403534  0.48992 0.692087 0.848649
Qmin=3.63195e+06 3.63189e+06  3.6337e+06 3.64256e+06 3.66503e+06 3.70845e+06 3.91918e+06 4.70639e+06
logp=0.254632 0.263425 0.290334 0.336663 0.403534  0.48992 0.692087 0.848649
Niter=1071;integral=0.769134Abs.error=5.88113e-05;rel.error=0.00012207;fail=0/0
g(0.2)=0.736862 g(0.4)=0.467703 g(0.6)=0.342523 g(0.8)=0.265837 
SKATO p=0.230866 (minP=0.141694; Bonf=1.13355)
#in mask=2
SV log10p:
0.00651554   0.380177
Wsq:
0.0137495 0.0505343
L:
15704.4
[muQ, scFac]=15704.4 0.711681
tau=31001.3 31158.3 31629.2 32414.2 33513.2 34926.1 38850.9 46684.9
Q:
24223.4 24229.2 24246.6 24275.7 24316.3 24368.6 24513.7 24803.4
skat-acat logp=0.265361 0.265429 0.265736 0.266591 0.268507 0.272191 0.289084 0.332709
acato logp=0.032355 0.265361 0.265429 0.265736 0.266591 0.268507 0.272191 0.289084 0.332709
Qmin=29941.8   29941 29929.8 29881.7   29754 29493.7 28318.3 24956.9
logp=0.265361 0.265429 0.265736 0.266591 0.268507 0.272191 0.289084 0.332709
Niter=1155;integral=0.406543Abs.error=1.13005e-05;rel.error=0.00012207;fail=0/0
g(0.2)=0.609359 g(0.4)=0.36171 g(0.6)=0 g(0.8)=0 
SKATO p=0.593457 (minP=0.464827; Bonf=3.71862)
#in mask=2
SV log10p:
0.00651554   0.207853
Wsq:
0.0137495 0.0394341
L:
14812.5
[muQ, scFac]=14812.5 0.788383
tau=23828.7 23976.8 24421.1 25161.5 26198.2 27531.1 31233.4 38623.4
Q:
7062.74 7065.88 7075.28 7090.95 7112.89  7141.1 7219.45 7375.84
skat-acat logp=0.0909892 0.0910337 0.0912203 0.0917117 0.0927923 0.0949142  0.106586   0.18239
acato logp=0.0242047 0.0909892 0.0910337 0.0912203 0.0917117 0.0927923 0.0949142  0.106586   0.18239
Qmin=13826.7   13826 13816.1 13773.1 13657.8 13417.1 12220.8 7627.67
logp=0.0909892 0.0910337 0.0912203 0.0917117 0.0927923 0.0949142  0.106586   0.18239
Niter=1281;integral=0.221861Abs.error=1.25942e-05;rel.error=0.00012207;fail=0/0
g(0.2)=0 g(0.4)=0 g(0.6)=0 g(0.8)=0 
SKATO p=0.778139 (minP=0.657068; Bonf=5.25654)
#in mask=0
SV log10p:
Wsq:
regenie_v3.0.gz_x86_64_Linux: ./external_libs/eigen3/Eigen/src/Core/Redux.h:413: typename Eigen::internal::traits<T>::Scalar Eigen::DenseBase<Derived>::redux(const Func&) const [with BinaryOp = Eigen::internal::scalar_max_op<double, double>; Derived = Eigen::CwiseUnaryOp<Eigen::internal::scalar_abs_op<double>, const Eigen::Matrix<double, -1, -1> >; typename Eigen::internal::traits<T>::Scalar = double]: Assertion `this->rows()>0 && this->cols()>0 && "you are using an empty matrix"' failed.
run_step2_continuous_xargs.sh: line 40:  6961 Aborted                 ./regenie_v3.0.gz_x86_64_Linux --step 2 --bgen ukb_${chr}.bgen --sample ukb_${chr}.sample --phenoFile Phenotype_TSV --covarFile Covariate_TSV --bsize 400 --minMAC=1 --catCovarList ${catvars} --maxCatLevels 30 --exclude ukb23149_450k_OQFE.90pct10dp_qc_variants.txt --apply-rint --vc-tests skat,skato,skato-acat,acatv,acato,acato-full --vc-maxAAF 0.01 --joint minp,acat,nnls --debug --anno-file annotations_chr${chr}.txt --set-list set_list_chr${chr}.txt --mask-def masks --pred ${pheno}_Step1_pred.list --aaf-bins 0.001,0.0001 --check-burden-files --out ${output}/${pheno}_Step2_Chr${chr}
joellembatchou commented 2 years ago

Hi Gareth,

I've just pushed in update (bec102fa8b66e4a8ad1116061bfadd3377055838) which will address this error (it will skip variant sets where there are no variants included). Let me know if you still encounter errors.

In the genotype file from --write-mask, are all 76 masks for the set in the file (19 mask categories & 4 AAF cutoff [including singletons])? (Masks with no variants will not be included in this file)

Cheers, Joelle

GHawkes93 commented 2 years ago

Thanks very much Joelle, that seems to have done the trick!!

I do occasionally see "ERROR: computing NNLS weight(i =6) & component(j = 3 out of 10) failed; pnorm(V11inv)*pnorm(V220) returned negative value (-9.05998e-08)", but that looks like a very small problem, and it doesn't seem to cause the session to stop.

Shall I still check the genotype file again?

Cheers, Gareth

joellembatchou commented 2 years ago

Hi Gareth,

Great to hear this addressed your issue. The error from NNLS test means it failed to get the p-value (so it would just print NA for the NNLS test result. You can check the PLINK BIM file for the written masks to see if the 76 masks are present for the set that gave the "# in mask = 0" [like grep -c "^SETNAME_" <(awk '{print $2}' PLINK_BIM_FILE)].