samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
640 stars 241 forks source link

bcftools norm (join biallelics) turns phased into unphased GTs #2090

Closed tobiasrausch closed 5 months ago

tobiasrausch commented 5 months ago

Hi,

For phased input genotypes, bcftools norm -m +any ... produces unphased and incorrect genotypes. Here is a minimal example to reproduce the bug:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HG00096
chr1    215871  ID1 GGAATAGGCTCGAATGGTATGGAATGCAATGGAATGGACTCGAATGGAATAGAACA    G   .   .   .   GT  1|0
chr1    215871  ID2 G   A   .   .   .   GT  0|0
chr1    215871  ID3 GGAATAGGCTCGAATGGTATGGAATGCAATGGAATGGACTCGAATGGAATAGAACA    G   .   .   .   GT  1|0

This yields a GT of 1/1 with the latest bcftools code. bcftools version 1.13 produces the correct output:

chr1    215871  ID1;ID2;ID3 GGAATAGGCTCGAATGGTATGGAATGCAATGGAATGGACTCGAATGGAATAGAACA    G,AGAATAGGCTCGAATGGTATGGAATGCAATGGAATGGACTCGAATGGAATAGAACA  .   .   .   GT  1|0

Thanks for looking into this!

Best, Tobias

pd3 commented 5 months ago

Thanks for the issue.

The problem is caused by the deletions ID1 and ID3 being identical. The program is not expecting this type of duplication. Assuming they are different, it can avoid the phase conflict either by discarding the variant or by breaking the phase, hence outputs ID1;ID3 1/1.

If the two variants were phased on input as ID1 1|0 and ID3 0|1, the program would preserve the phase and output ID1;ID3 1|1.

Now the question is: are the variants really identical, or is just an unfortunate example? If they are not supposed to be identical, then the program is already doing the right thing. However, if the example really meant them to be identical, then that raises a question where is the right place to fix this - at the side of the VCF producer, or is it worth improving the norm algorithm to handle cases like this.

tobiasrausch commented 5 months ago

Thanks for the explanation. Indeed most of the unpased GTs occur at phase conflicts and I agree, breaking the phase is probably better than discarding variants. For the duplicates, you already have the -D option and

bcftools norm -D -m +any ...

produces the correct result. I am sorry that's all fine then.