rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
187 stars 55 forks source link

Unable to run regenie with Genomics England imputed genotypes on the UKB-RAP platform #491

Closed kauralasoo closed 9 months ago

kauralasoo commented 9 months ago

I hope this issue is not completely inappropriate to report here.

We've now been trying for a couple of days to run regenie on the UKB-RAP platform with Genomics England imputed genotypes (data field 21008) with no success. The same command works without issues with the old WTCHG imputed genotypes (data field 22828).

This command with WTCHG imputation works just fine:

regenie --step 2 --bgen ukb22828_c20_b0_v3.bgen --sample ukb22828_c20_b0_v3.sample \
--ref-first --phenoFile UKBB_creatinine_300k.tsv --covarFile UKBB_creatinine_300k.tsv \
--phenoCol Creatinine --qt --covarCol Total_BCAA --covarCol Lactate \
--chr 20 --out assoc_GEL_imputation.20 --bsize 200 --ignore-pred \
--pThresh 0.05 --minMAC 20 --minINFO 0.6 --apply-rint --threads 8 --gz

This is the regenie log:

Start time: Fri Jan 26 17:34:40 2024
|===========================|
| REGENIE v3.4.gz |
|===========================|
Copyright (c) 2020-2023 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini.
Distributed under the MIT License.
Compiled with Boost Iostream library.
Compiled with HTSlib.
Using Intel MKL with Eigen.
Log of output saved in file : assoc_GEL_imputation.20.log
Options in effect:
--step 2 \
--bgen ukb22828_c20_b0_v3.bgen \
--sample ukb22828_c20_b0_v3.sample \
--ref-first \
--phenoFile UKBB_creatinine_300k.tsv \
--covarFile UKBB_creatinine_300k.tsv \
--phenoCol Creatinine \
--qt \
--covarCol Total_BCAA \
--covarCol Lactate \
--chr 20 \
--out assoc_GEL_imputation.20 \
--bsize 200 \
--ignore-pred \
--pThresh 0.05 \
--minMAC 20 \
--minINFO 0.6 \
--apply-rint \
--threads 8 \
--gz
Association testing mode with fast multithreading using OpenMP
* bgen : [ukb22828_c20_b0_v3.bgen]
-summary : bgen file (v1.2 layout, zlib compressed) with 487409 anonymous samples and 2082571 variants with 8-bit encoding.
-index bgi file [ukb22828_c20_b0_v3.bgen.bgi]
-sample file: ukb22828_c20_b0_v3.sample
* phenotypes : [UKBB_creatinine_300k.tsv] n_pheno = 1
-dropping observations with missing values at any of the phenotypes
-number of phenotyped individuals with no missing data = 271500
* covariates : [UKBB_creatinine_300k.tsv] n_cov = 2
-number of individuals with covariate data = 271500
* number of individuals used in analysis = 271500
-applying RINT to all phenotypes
* number of observations for each trait:
- 'Creatinine': 271500 observations
* no step 1 predictions given. Simple linear regression will be performed
-residualizing and scaling phenotypes...done (2ms)
* # threads : [8]
* block size : [200]
* # blocks : [10413]
* approximate memory usage : 2GB
* using minimum MAC of 20 (variants with lower MAC are ignored)
* using minimum imputation info score of 0.6 (variants with lower info score are ignored)
* user specified to test only on select chromosomes
Chromosome 20 [10413 blocks in total]
block [1/10413] : done (1609ms)
....

But changing the bgen file to the Genomics England imputation and replacing the chromosome filter with 'chr20' fails with BGenError:

regenie --step 2 --bgen ukb21008_c20_b0_v1.bgen --sample ukb21008_c20_b0_v1.sample \
--ref-first --phenoFile UKBB_creatinine_300k.tsv --covarFile UKBB_creatinine_300k.tsv \
--phenoCol Creatinine --qt --covarCol Total_BCAA --covarCol Lactate \
--chr chr20 --out assoc_GEL_imputation.20 --bsize 200 --ignore-pred \
--pThresh 0.05 --minMAC 20 --minINFO 0.6 --apply-rint --threads 8 --gz

This the regenie log:

Start time: Fri Jan 26 15:27:19 2024
|===========================|
| REGENIE v3.4.gz |
|===========================|
Copyright (c) 2020-2023 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini.
Distributed under the MIT License.
Compiled with Boost Iostream library.
Compiled with HTSlib.
Using Intel MKL with Eigen.
Log of output saved in file : assoc_GEL_imputation.20.log
Options in effect:
--step 2 \
--bgen ukb21008_c20_b0_v1.bgen \
--sample ukb21008_c20_b0_v1.sample \
--ref-first \
--phenoFile UKBB_creatinine_300k.tsv \
--covarFile UKBB_creatinine_300k.tsv \
--phenoCol Creatinine \
--qt \
--covarCol Total_BCAA \
--covarCol Lactate \
--chr chr20 \
--out assoc_GEL_imputation.20 \
--bsize 200 \
--ignore-pred \
--pThresh 0.05 \
--minMAC 20 \
--minINFO 0.6 \
--apply-rint \
--threads 8 \
--gz
Association testing mode with fast multithreading using OpenMP
* bgen : [ukb21008_c20_b0_v1.bgen]
-summary : bgen file (v1.2 layout, zstd compressed) with 488315 anonymous samples and 7899620 variants with 8-bit encoding.
-index bgi file [ukb21008_c20_b0_v1.bgen.bgi]
ERROR: BGenError

We are using the SwissArmyKnife v4.9.1 executable on the UKB-RAP and running regenie from this Docker image:

ghcr.io/rgcgithub/regenie/regenie:v3.4.gz

The cloud instance type is mem3_ssd1_v2_x8.

Has anyone else encountered the same problem?

joellembatchou commented 9 months ago

Hi,

Can you try moving the bgi file 'ukb21008_c20_b0_v1.bgen.bgi' to another folder and re-running REGENIE (so REGENIE does not use it)? The error from the log seems to point to an issue when reading the index BGI file and matching it with the genotype file.

kauralasoo commented 9 months ago

Thanks! Indeed, it looks like the the .bgi files for the GEL imputation are broken on the UKB-RAP paltform. Running regenie without the .bgi index and chromosome filter works!