In BCF, the Character/String type does not have MISSING or EOV encoding given in the spec. htslib and GenomicsDB define MISSING and EOV for String/Character to be 0x07 and 0x00 respectively, but htslib only seems to convert 0x07 to . when converting BCF to VCF, but does not convert . to 0x07 when writing VCF as BCF.
My question is how a VCF record with missing Characters and missing Strings are encoded in BCF. If the spec is following htslib, I think missing Character should be defined to be encoded as a length 1 String whose only byte is 0x07, and a missing String, being an entirely missing vector of Character, would be [0x07,0x00,0x00,...] because of https://github.com/samtools/hts-specs/pull/617.
As a separate issue, it's not well defined what the Character type in VCF means. In BCF, Character is one 7-bit ASCII byte, but in VCF which is UTF-8 encoded, Character could be a byte, a Unicode codepoint, or a grapheme.
In BCF, the Character/String type does not have MISSING or EOV encoding given in the spec. htslib and GenomicsDB define MISSING and EOV for String/Character to be
0x07
and0x00
respectively, but htslib only seems to convert0x07
to.
when converting BCF to VCF, but does not convert.
to0x07
when writing VCF as BCF.My question is how a VCF record with missing Characters and missing Strings are encoded in BCF. If the spec is following htslib, I think missing Character should be defined to be encoded as a length 1 String whose only byte is
0x07
, and a missing String, being an entirely missing vector of Character, would be[0x07,0x00,0x00,...]
because of https://github.com/samtools/hts-specs/pull/617.As a separate issue, it's not well defined what the Character type in VCF means. In BCF, Character is one 7-bit ASCII byte, but in VCF which is UTF-8 encoded, Character could be a byte, a Unicode codepoint, or a grapheme.