samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
662 stars 240 forks source link

bcftools annotate can output an INFO field with unquoted semicolons #2202

Closed jkmatila closed 4 months ago

jkmatila commented 4 months ago

bcftools annotate can output an INFO field value with unquoted semicolons (;). This causes the part after the semicolon to be interpreted as another INFO field when parsed. If the part after the semicolon contains the comma character, the resulting file cannot be viewed using bcftools view, instead producing an error.

Steps to reproduce:

A minimal VCF file to annotate:

$ cat repro.vcf
##fileformat=VCFv4.3
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr20>
##reference=hg38
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  repro
chr20   33791101    .   GC  G   .   .   .   GT  0/1

Annotations file, containing an annotation value that contains a semicolon:

$ cat annots.txt
chr20   33791101    GC  G   ENST00000342427.6:c.2129delC,ENST00000342427.6:p.K711Rfs*47;ENST00000375200.6:c.2150delC,ENST00000375200.6:p.K718Rfs*47

A header line to use for the new annotation:

$ cat header.txt
##INFO=<ID=FOO,Number=1,Type=String,Description="Yet another header line">

Annotating the VCF file:

$ bgzip annots.txt
$ tabix -s 1 -b 2 -e 2 annots.txt.gz
$ bcftools annotate -a annots.txt.gz -h header.txt -c CHROM,POS,REF,ALT,FOO repro.vcf > out.vcf

We can see that it produced a VCF file where INFO field separator ; appears unquoted:

$ cat out.vcf
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr20>
##reference=hg38
##INFO=<ID=FOO,Number=1,Type=String,Description="Yet another header line">
##bcftools_annotateVersion=1.20+htslib-1.20
##bcftools_annotateCommand=annotate -a annots.txt.gz -h header.txt -c CHROM,POS,REF,ALT,FOO repro.vcf; Date=Fri May 31 10:13:26 2024
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  repro
chr20   33791101    .   GC  G   .   .   FOO=ENST00000342427.6:c.2129delC,ENST00000342427.6:p.K711Rfs*47;ENST00000375200.6:c.2150delC,ENST00000375200.6:p.K718Rfs*47 GT  0/1

This is not accepted by bcftools view, because it parses the part after the semicolon to be another info field, and tries to create a dummy header line for it, which fails due to the comma embedded in it:

$ bcftools view out.vcf
[W::vcf_parse_info] INFO 'ENST00000375200.6:c.2150delC,ENST00000375200.6:p.K718Rfs*47' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##INFO=<ID=ENST00000375200.6:c.2150delC,ENST00000375200.6:p.K718Rfs*47,Number=1,Type=String,Description=\"Dummy\">"
[E::vcf_parse_info] Could not add dummy header for INFO 'ENST00000375200.6:c.2150delC,ENST00000375200.6:p.K718Rfs*47' at chr20:33791101
Error: VCF parse error
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr20>
##reference=hg38
##INFO=<ID=FOO,Number=1,Type=String,Description="Yet another header line">
##bcftools_annotateVersion=1.20+htslib-1.20
##bcftools_annotateCommand=annotate -a annots.txt.gz -h header.txt -c CHROM,POS,REF,ALT,FOO repro.vcf; Date=Fri May 31 10:13:26 2024
##bcftools_viewVersion=1.20+htslib-1.20
##bcftools_viewCommand=view out.vcf; Date=Fri May 31 10:15:56 2024
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  repro

Additional information

VCF v4.3 spec, Section 1.2 says:

Some characters have a special meaning when they appear (such as field delimiters ‘;’ in INFO or ‘:’ FORMAT fields), and for any other meaning they must be represented with the capitalized percent encoding; [...]

bcftools version

$ bcftools version
bcftools 1.20
Using htslib 1.20
Copyright (C) 2024 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Files used in the steps to reproduce

repro.zip

pd3 commented 4 months ago

The program now makes sure characters with special meaning are encoded.