samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
276 stars 244 forks source link

Escaped doublequotes in INFO descriptions result in invalid VCF file #1661

Open bartcharbon opened 1 year ago

bartcharbon commented 1 year ago

Edit 14/03: verified that this also occurs in version 3.0.4

Description of the issue:

When I add a header including a description containing escaped double quotes, sometimes the "escape slash" goes missing, resulting in a invalid VCF file.

Your environment:

Steps to reproduce

VCFHeader newHeader = annotator.annotateHeader(vcfFileReader.getFileHeader());    

newHeader(new VCFFormatHeaderLine("TEST", VCFHeaderLineCount.A, VCFHeaderLineType.String,"\"TEST\""));

writer.writeHeader(newHeader);
//... write variants

Expected behaviour

A VCF file is written with an INFO header: ##FORMAT=<ID=TEST,Number=A,Type=String,Description="\"TEST\"">

Actual behaviour

A VCF file is written with an INFO header: ##FORMAT=<ID=TEST,Number=A,Type=String,Description=""TEST\"">

The slash for the first escaped double quote is missing

bartcharbon commented 1 year ago

Addition: this seems to be happening only for escaped quotes at the very start of the description

cmnbroad commented 1 year ago

Thanks for the bug report. Looks like the internal representation is correct ("""TEST""), but it gets serialized as ""TEST\"" by VCFHeaderLine.escapeQuotes.