rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
182 stars 53 forks source link

Strage INFO scores with sex chromosomes #333

Closed Ojami closed 1 year ago

Ojami commented 2 years ago

Hi Joelle,

In the documentation, it has been suggested that users should turn to pgen/bed files for analysing sex chromosomes:

To include X chromosome genotypes in step 1 and/or step 2, males should be coded as diploid so that their genotypes are 0/2 (this is done automatically for BED and PGEN file formats with haploid genotypes). Chromosome values of 23 (for human analyses), X, Y, XY, PAR1 and PAR2 are all acceptable and will be collapsed into a single chromosome.

I'm using UK Biobank imputation BGEN files, and realized REGENIE works just fine even with BGEN files (all coded as diploids already?) and stats are same as when I use PGEN file (generated from the BGEN file), except for INFO score which is > 1 for PGEN.

PGEN:

CHROM GENPOS ID ALLELE0 ALLELE1 A1FREQ INFO N TEST BETA SE CHISQ LOG10P EXTRA
23 2699555 rs311165 C A 0.414361 1.21599 347126 ADD 0.00155687 0.00198735 0.613699 0.363113 NA

BGEN:

CHROM GENPOS ID ALLELE0 ALLELE1 A1FREQ INFO N TEST BETA SE CHISQ LOG10P EXTRA
23 2699555 rs311165 C A 0.414361 0.753854 347126 ADD 0.00155694 0.00198734 0.613764 0.363137 NA

So, my questions are: 1- Why do INFO scores are different? This INFO score is different than the one originally relased by UKBB based QCTOOL (the difference between REGENIE and QCTOOL INFO scores when using autosomal chromosomes is much smaller). 2- What's the different between using BGEN and PGEN here? Aren't they the same at least for UKBB (seems males are already diploids).

Thanks! Oveis

joellembatchou commented 2 years ago

Hi Oveis,

1-For PGEN format, REGENIE uses Mach Rsq INFO score whereas for BGEN format it uses the IMPUTE INFO score (see here for details). IMPUTE INFO score requires genotype probabilities which is why we don't use it for PGEN which only contains dosages. 2- There is no difference in file format as long as you have the males coded as 0/2 for non-PAR X.

Cheers, Joelle