samtools / htslib

C library for high-throughput sequencing data formats
Other
784 stars 447 forks source link

bcftools corrupts duplicate GT format fields #1733

Closed anthakki closed 3 months ago

anthakki commented 5 months ago

Running bcftools (here filter but seems to affect other commands as well) on a VCF with duplicate GT (genotype) FORMAT fields seems to change all but the last GT value. Looks like ./., 0/0, 0/1, 1/1 get converted to 0,0, 2,2, 2,4, and 4,4, respectively. I'm not 100% sure if duplicate GT values are legal, but I would expect an error instead of invalid data. Non-GT fields don't seem to have the problem. I'm using bcftools 1.19, but this can also be reproduced in bcftools 1.12.

Minimized test case follows. I would expect the payload to match that of the input.

$ cat foo.vcf 
##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String>
##FORMAT=<ID=X,Number=1,Type=Integer>
##contig=<ID=chr1>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  A
chr1    1   .   A   C   .   .   .   GT:X:GT:X   0/1:9:0/1:9
$ bcftools filter foo.vcf | sed '/^#/d'
chr1    1   .   A   C   .   PASS    .   GT:X:GT:X   2,4:9:0/1:9
pd3 commented 5 months ago

Duplicate tags are not allowed. I am not sure if it is explicitly stated in the VCF specification, but that was the intention.

The parsing is done in htslib, ideally it should give a warning and drop the duplicate fields. Obviously, easiest solution is to avoid producing invalid VCFs :)