odelaneau / shapeit5

Segmented HAPlotype Estimation and Imputation Tool
https://odelaneau.github.io/shapeit5/
MIT License
66 stars 9 forks source link

Phasing chromosome X is wrong when males are encoded as haploids in input VCF/BCF #99

Open dtaliun opened 5 months ago

dtaliun commented 5 months ago

Hi,

Phasing results for chromosome X non-PAR regions are wrong when providing input VCF/BCF files with haploid male genotypes represented as 0 and 1 instead of 0/0 and 1/1.

The issue is in the lines 89-91 in the genotype_reader::readGenotypes():

for(int32_t i = 0 ; i < 2 * n_main_samples ; i += 2) {
    bool a0 = (bcf_gt_allele(main_buffer[i+0])==1);
    bool a1 = (bcf_gt_allele(main_buffer[i+1])==1);
    bool mi = (main_buffer[i+0] == bcf_gt_missing || main_buffer[i+1] == bcf_gt_missing);

When the genotype is haploid, the second value (i.e. a1) will hold bcf_int32_vector_end, denoting the end of genotype. Thus, current code will transform GT = '1' to GT = '1/0', which is heterozygous. All heterozygous calls will be replaced by missing values, which will completely invalidate phasing results.

I suggest the following change to support both haploid male gemotype coding (i.e. GT=0 and 1; and GT = 0/0 and 1/1) in input VCF/BCF:

for(int32_t i = 0 ; i < 2 * n_main_samples ; i += 2) {
    bool a0 = (bcf_gt_allele(main_buffer[i+0])==1);
    bool a1 = (main_buffer[i + 1] == bcf_int32_vector_end) ? a0 : (bcf_gt_allele(main_buffer[i+1])==1);
        bool mi = (main_buffer[i + 1] == bcf_int32_vector_end) ? (main_buffer[i+0] == bcf_gt_missing) : (main_buffer[i+0] == bcf_gt_missing || main_buffer[i+1] == bcf_gt_missing);

P.S. I caught this when I got an error similar to Issue #33 . Then, I noticed that phase_common phased many of haploid samples as hets.