Open pd3 opened 7 years ago
Mostly a duplicate of #138.
In order to get this ticket and #138 resolved, could any of you please confirm that the only change required in the spec is to state that the maximum header size is 4 bytes? Thanks!
To document the existing format, the spec needs to say the things that the quick reference document says: that there is a BCF\2\2 magic string (BCF\2\1 in the 4.1 and 4.2 specs) followed by a 4-byte uint32_t
length followed by the NUL
-terminated VCF header text. In particular, it needs to mention the length field and the NUL termination, neither of which it currently mentions. The best way to do this would be to copy the Field/magic/l_text/text rows from the table in the quick reference, so that this is described via a similar table to BCFv2_qref or BAM in SAMv1 §4.2.
The quick reference document (which is currently the only documentation of this outwith the implementations) says that the VCF header text is always NUL-terminated and that _ltext includes the NUL padding character. (This differs from BAM, where the NUL is optional.) In practice, HTSlib has always written a NUL terminator here and it appears HTSlib has been able to accept non-NUL-terminated BCF headers since v1.4 (see samtools/htslib@37c85e4b9737abc88871daf724552959b612e7cb). I haven't researched what other implementations require w.r.t. NUL termination. I would suggest the compatible route is to stick with the existing documentation from the qref and require the NUL terminator character.
Whether and how to extend this (perhaps in a future VCF version) to allow more than 4Gb of headers is a separate question.
I added the "enhancement" label. This issue was not intended as a duplicate of #138, which only comments on the missing l_text field in the specification, but asks if the new version of BCF shouldn't extend this field to 8 bytes.
The BCF specification does not explicitly state what is the maximum header size, but htslib implementation assumes a 4 byte integer https://github.com/samtools/htslib/blob/develop/vcf.c#L909-L913
This is a problem for assemblies, with millions of contigs the header size can easily exceed the limit (https://github.com/samtools/bcftools/issues/672).
It would be good explicitly state this limit in the specification. Also in the next version of BCF it might be worthwhile to reserve 8 bytes for the header length.