samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
655 stars 173 forks source link

BCF Character/String type MISSING/EOV encoding #618

Open andersleung opened 2 years ago

andersleung commented 2 years ago

In BCF, the Character/String type does not have MISSING or EOV encoding given in the spec. htslib and GenomicsDB define MISSING and EOV for String/Character to be 0x07 and 0x00 respectively, but htslib only seems to convert 0x07 to . when converting BCF to VCF, but does not convert . to 0x07 when writing VCF as BCF.

My question is how a VCF record with missing Characters and missing Strings are encoded in BCF. If the spec is following htslib, I think missing Character should be defined to be encoded as a length 1 String whose only byte is 0x07, and a missing String, being an entirely missing vector of Character, would be [0x07,0x00,0x00,...] because of https://github.com/samtools/hts-specs/pull/617.

As a separate issue, it's not well defined what the Character type in VCF means. In BCF, Character is one 7-bit ASCII byte, but in VCF which is UTF-8 encoded, Character could be a byte, a Unicode codepoint, or a grapheme.

h-2 commented 2 years ago

I second this.

The specification of (partly) empty vectors is really inprecise. See also #593.