rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
187 stars 55 forks source link

Assertion chrStrToInt(chromosome, params->nChrom) == chrom' fails in Step 1 #73

Closed seppinho closed 3 years ago

seppinho commented 3 years ago

Hi, I'm running regenie on the UKBB data using the latest Docker Image and prepared the genotypes as described in the tutorial. This is the failed assertion I'm currently seeing:

block [594] : regenie: src/Geno.cpp:1054: void readChunkFromBGENFileToG(int, int, uint32_t, std::vector<snp>&, param*, geno_block*, filter*, const Eigen::Ref<const Eigen::Matrix<bool, -1, -1> >&, const Eigen::Ref<const Eigen::Matrix<double, -1, -1> >&, mstream&): Assertion `chrStrToInt(chromosome, params->nChrom) == chrom' failed.
Options in effect:
  --step 1 \
  --bgen /docker/imputed/XXX.bgen \
  --extract /docker/imputed/XXX.snplist \
  --keep /docker/imputed/XXX.id \
  --phenoFile /docker/phenotype/XXX.txt \
  --covarFile /docker/phenotype/XXX.txt \
  --bsize 1000 \
  --lowmem \
  --out /docker/ukb_step1_bgen \
  --covarColList <...> \
  --phenoColList <...> \
  --sample /docker/imputed/XXX.sample

Any advice would be highly appreciated!

joellembatchou commented 3 years ago

Hi,

Which Regenie version is it? 1.0.7?

If there is an accompanying .bgi file, in the more recent versions it will use the bgi file automatically to get variant information (it should be clear from the log). The error indicates there is some discrepancy regarding the variant chromosome information that's read from the BGEN (or .bgi) file.

seppinho commented 3 years ago

Thanks for the reply. I used v1.0.6.9. Fyi, I've now created the bgi index before running Step1. Any checks I should execute on the bgen/bgi file before I re-run Step 1 with the regenie docker version 1.0.7? Thanks again for your help! Sebastian

joellembatchou commented 3 years ago

Hi Sebastian,

No additional checks are needed and you can use the same command as above. Let me know if the issue still persists.

Cheers, Joelle

seppinho commented 3 years ago

Hi again, So the assertion fails again with the latest docker version (1.0.7) and it is also reproducible since it always fails on block 594.

Could it also be a memory issue? (node only has 60 GB main memory)

This is the regenie output, just in case it helps:

Fitting null model
 * bgen             : [/docker/imputed/merged_filtered.bgen]
   -summary : bgen file (v1.2 layout, compressed) with 487409 named samples and 7734420 variants.
   -index bgi file [/docker/imputed/merged_filtered.bgen.bgi]
   -keeping only variants specified in [/docker/imputed/xxx.snplist]
     +number of variants remaining in the analysis = 7734420
   -sample file: /docker/imputed/xxx.sample
   -keeping only individuals specified in [/docker/imputed/xxx.id]
     +number of genotyped individuals to keep in the analysis = 487409
 * phenotypes       : [/docker/phenotype/xxx.txt] n_pheno = 2
   -keeping and mean-imputing missing observations (done for each trait)
   -number of phenotyped individuals = 371460
 * covariates       : [/docker/phenotype/xxx.txt] n_cov = 13
   -number of individuals with covariate data = 487282
 * number of individuals used in analysis = 371460
   -residualizing and scaling phenotypes...done (8ms) 
 * # threads        : [32]
 * block size       : [1000]
 * # blocks         : [7745]
 * # CV folds       : [5]
 * ridge data_l0    : [5 : 0.01 0.25 0.5 0.75 0.99 ]
 * ridge data_l1    : [5 : 0.01 0.25 0.5 0.75 0.99 ]
 * approximate memory usage : 142GB
 * writing level 0 predictions to disk
   -temporary files will have prefix [/docker/ukb_step1_bgen_l0_Y]
   -approximate disk space needed : 282GB
 * setting memory...done

Chromosome 1
 block [1] : 1000 snps  (42883ms) 
 ...
 block [593] : 1000 snps  (43964ms) 
   -residualizing and scaling genotypes...done (5537ms) 
   -calc working matrices...done (8463ms) 
   -calc level 0 ridge...done (13215ms) 
 block [594] : regenie: src/Geno.cpp:1054: void readChunkFromBGENFileToG(int, int, uint32_t, std::vector<snp>&, param*, geno_block*, filter*, const Eigen::Ref<const Eigen::Matrix<bool, -1, -1> >&, const Eigen::Ref<const Eigen::Matrix<double, -1, -1> >&, mstream&): Assertion `chrStrToInt(chromosome, params->nChrom) == chrom' failed.

Thanks again

seppinho commented 3 years ago

I looked into the chrStrToInt method and looks like I might forgot to filter for autosomes. Could that result in the failed assertion? One more thing regarding this issue: Since UKBB imputed datasets are in bgen format, I used cat-bgen to merge all data to one bgen file directly (avoid additional disk space for plink conversion) and then used plink to filter the data (see below). The regenie UKBB analysis starts with the UKBB data in plink-format. Would you recommend to convert to plink first?

plink2 --bgen ukb_imp_v3.bgen ref-first --chr 1-22 --maf 0.01 --mac 100 --geno 0.1 --hwe 1e-15 --mind 0.1 --write-snplist --write-samples --no-id-header --out merged_filtered --sample XXX.sample --export bgen-1.2 &

Thx. Sebastian

rgcgithub commented 3 years ago

Hi Sebastian,

It looks to me as if you are passing a set of 7.7M imputed SNPs to REGENIE Step 1. The REGENIE workflow has 2 steps. In Step 1 a whole genome regression model is fit to a set of common SNPs. For the UKBB data we recommend using the Array genotypes (say 500,000 or so). The output of this step (essentially a set of polygenic risk scores using the LOCO scheme) is passed to Step 2, along with the set of SNPs you want to test, which for UKBB will be the imputed genotypes. I'd recommend switching to this workflow, and then see if you get the same issues.

We have a page here https://rgcgithub.github.io/regenie/recommendations/ that aims to describe the workflow using the array and imputed files distributed by UKBB.

Best wishes, Jonathan

seppinho commented 3 years ago

Hi Jonathan, hi Joelle, Awesome. Thank you so much for your feedback. I wrongly assumed that the genotyped markers needed for step 1 are extracted from the imputed data. I'll re-run step 1 using the array data and will report back. Thanks again!!

seppinho commented 3 years ago

Just wanted to report back that using the array genotypes instead of the imputed data solved my issue within step1. Thanks again for your help! Sebastian