Open dtaliun opened 2 months ago
Following up on this issue. Looking at your scripts/documentation, I can't find any information if a check for the presence of het genotypes in non-PAR regions in males was done before phasing UK Biobank data. I.e. do you think this issue may have had an effect in the UK Biobank phasing of chrX nonPAR? Thanks!
Hi,
When phasing non-PAR X, if present in the input VCF/BCF, the heterozygous haploids are not correctly set to missing. Here is the problematic code:
The first 2 bits are 10 for het, and it is picked up by
VAR_GET_AMB
. Next,VAR_SET_MIS
sets the last two bits to 11. However, code 11 is not recognized as the code for missing genotype by theVAR_GET_MIS
(in fact, it is recognized byVAR_GET_SCA
only). The problem is that onlyVAR_GET_MIS
is used later, for example, to check if the missing values should be imputed. As a result, the heterozygous haploids are being propagated to the final phased output.One option to resolve this is to clear the first 2 bits before setting them to code 01 (missing) e.g.:
Note for the users who read this issue: If you are sure that you don't have haploids (i.e. males) on non-PAR chrX with het values, then everything should be fine. Use
bcftools +fixploidy
andbcftools +check-ploidy
plugins to check your data. If your haploids in your input VCF/BCF are coded according to the spec with single GT value (i.e. 0 instead of 0/0 and 1 instead of 1/1), the see issue #99 .