odelaneau / shapeit5

Segmented HAPlotype Estimation and Imputation Tool
https://odelaneau.github.io/shapeit5/
MIT License
56 stars 9 forks source link

Bug for nonPAR chrX: heterozygous haploids in the phased output. #100

Open dtaliun opened 2 months ago

dtaliun commented 2 months ago

Hi,

When phasing non-PAR X, if present in the input VCF/BCF, the heterozygous haploids are not correctly set to missing. Here is the problematic code:

for (uint32_t v = 0 ; v < n_variants ; v++) {
    if (VAR_GET_AMB(MOD2(v), Variants[DIV2(v)])) {
            VAR_SET_MIS(MOD2(v), Variants[DIV2(v)]);
            nreset ++;
        }
    }

The first 2 bits are 10 for het, and it is picked up by VAR_GET_AMB. Next, VAR_SET_MIS sets the last two bits to 11. However, code 11 is not recognized as the code for missing genotype by the VAR_GET_MIS (in fact, it is recognized by VAR_GET_SCA only). The problem is that only VAR_GET_MIS is used later, for example, to check if the missing values should be imputed. As a result, the heterozygous haploids are being propagated to the final phased output.

One option to resolve this is to clear the first 2 bits before setting them to code 01 (missing) e.g.:

#define VAR_CLEAR_FLAG(e, v)    ((v)&=(~(3<<((e)<<2))))

for (uint32_t v = 0 ; v < n_variants ; v++) {
    if (VAR_GET_AMB(MOD2(v), Variants[DIV2(v)])) {
                    VAR_CLEAR_FLAG(MOD2(v), Variants[DIV2(v)]);
            VAR_SET_MIS(MOD2(v), Variants[DIV2(v)]);
            nreset ++;
        }
    }

Note for the users who read this issue: If you are sure that you don't have haploids (i.e. males) on non-PAR chrX with het values, then everything should be fine. Use bcftools +fixploidy and bcftools +check-ploidy plugins to check your data. If your haploids in your input VCF/BCF are coded according to the spec with single GT value (i.e. 0 instead of 0/0 and 1 instead of 1/1), the see issue #99 .

dtaliun commented 2 months ago

Following up on this issue. Looking at your scripts/documentation, I can't find any information if a check for the presence of het genotypes in non-PAR regions in males was done before phasing UK Biobank data. I.e. do you think this issue may have had an effect in the UK Biobank phasing of chrX nonPAR? Thanks!