Closed seppinho closed 3 years ago
Hi,
Which Regenie version is it? 1.0.7?
If there is an accompanying .bgi file, in the more recent versions it will use the bgi file automatically to get variant information (it should be clear from the log). The error indicates there is some discrepancy regarding the variant chromosome information that's read from the BGEN (or .bgi) file.
Thanks for the reply. I used v1.0.6.9. Fyi, I've now created the bgi index before running Step1. Any checks I should execute on the bgen/bgi file before I re-run Step 1 with the regenie docker version 1.0.7? Thanks again for your help! Sebastian
Hi Sebastian,
No additional checks are needed and you can use the same command as above. Let me know if the issue still persists.
Cheers, Joelle
Hi again, So the assertion fails again with the latest docker version (1.0.7) and it is also reproducible since it always fails on block 594.
Could it also be a memory issue? (node only has 60 GB main memory)
This is the regenie output, just in case it helps:
Fitting null model
* bgen : [/docker/imputed/merged_filtered.bgen]
-summary : bgen file (v1.2 layout, compressed) with 487409 named samples and 7734420 variants.
-index bgi file [/docker/imputed/merged_filtered.bgen.bgi]
-keeping only variants specified in [/docker/imputed/xxx.snplist]
+number of variants remaining in the analysis = 7734420
-sample file: /docker/imputed/xxx.sample
-keeping only individuals specified in [/docker/imputed/xxx.id]
+number of genotyped individuals to keep in the analysis = 487409
* phenotypes : [/docker/phenotype/xxx.txt] n_pheno = 2
-keeping and mean-imputing missing observations (done for each trait)
-number of phenotyped individuals = 371460
* covariates : [/docker/phenotype/xxx.txt] n_cov = 13
-number of individuals with covariate data = 487282
* number of individuals used in analysis = 371460
-residualizing and scaling phenotypes...done (8ms)
* # threads : [32]
* block size : [1000]
* # blocks : [7745]
* # CV folds : [5]
* ridge data_l0 : [5 : 0.01 0.25 0.5 0.75 0.99 ]
* ridge data_l1 : [5 : 0.01 0.25 0.5 0.75 0.99 ]
* approximate memory usage : 142GB
* writing level 0 predictions to disk
-temporary files will have prefix [/docker/ukb_step1_bgen_l0_Y]
-approximate disk space needed : 282GB
* setting memory...done
Chromosome 1
block [1] : 1000 snps (42883ms)
...
block [593] : 1000 snps (43964ms)
-residualizing and scaling genotypes...done (5537ms)
-calc working matrices...done (8463ms)
-calc level 0 ridge...done (13215ms)
block [594] : regenie: src/Geno.cpp:1054: void readChunkFromBGENFileToG(int, int, uint32_t, std::vector<snp>&, param*, geno_block*, filter*, const Eigen::Ref<const Eigen::Matrix<bool, -1, -1> >&, const Eigen::Ref<const Eigen::Matrix<double, -1, -1> >&, mstream&): Assertion `chrStrToInt(chromosome, params->nChrom) == chrom' failed.
Thanks again
I looked into the chrStrToInt method
and looks like I might forgot to filter for autosomes. Could that result in the failed assertion?
One more thing regarding this issue: Since UKBB imputed datasets are in bgen format, I used cat-bgen
to merge all data to one bgen file directly (avoid additional disk space for plink conversion) and then used plink to filter the data (see below). The regenie UKBB analysis starts with the UKBB data in plink-format. Would you recommend to convert to plink first?
plink2 --bgen ukb_imp_v3.bgen ref-first --chr 1-22 --maf 0.01 --mac 100 --geno 0.1 --hwe 1e-15 --mind 0.1 --write-snplist --write-samples --no-id-header --out merged_filtered --sample XXX.sample --export bgen-1.2 &
Thx. Sebastian
Hi Sebastian,
It looks to me as if you are passing a set of 7.7M imputed SNPs to REGENIE Step 1. The REGENIE workflow has 2 steps. In Step 1 a whole genome regression model is fit to a set of common SNPs. For the UKBB data we recommend using the Array genotypes (say 500,000 or so). The output of this step (essentially a set of polygenic risk scores using the LOCO scheme) is passed to Step 2, along with the set of SNPs you want to test, which for UKBB will be the imputed genotypes. I'd recommend switching to this workflow, and then see if you get the same issues.
We have a page here https://rgcgithub.github.io/regenie/recommendations/ that aims to describe the workflow using the array and imputed files distributed by UKBB.
Best wishes, Jonathan
Hi Jonathan, hi Joelle, Awesome. Thank you so much for your feedback. I wrongly assumed that the genotyped markers needed for step 1 are extracted from the imputed data. I'll re-run step 1 using the array data and will report back. Thanks again!!
Just wanted to report back that using the array genotypes instead of the imputed data solved my issue within step1. Thanks again for your help! Sebastian
Hi, I'm running regenie on the UKBB data using the latest Docker Image and prepared the genotypes as described in the tutorial. This is the failed assertion I'm currently seeing:
Any advice would be highly appreciated!