samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
643 stars 174 forks source link

The maximum BCF header size #236

Open pd3 opened 7 years ago

pd3 commented 7 years ago

The BCF specification does not explicitly state what is the maximum header size, but htslib implementation assumes a 4 byte integer https://github.com/samtools/htslib/blob/develop/vcf.c#L909-L913

This is a problem for assemblies, with millions of contigs the header size can easily exceed the limit (https://github.com/samtools/bcftools/issues/672).

It would be good explicitly state this limit in the specification. Also in the next version of BCF it might be worthwhile to reserve 8 bytes for the header length.

jmarshall commented 7 years ago

Mostly a duplicate of #138.

cyenyxe commented 6 years ago

In order to get this ticket and #138 resolved, could any of you please confirm that the only change required in the spec is to state that the maximum header size is 4 bytes? Thanks!

jmarshall commented 6 years ago

To document the existing format, the spec needs to say the things that the quick reference document says: that there is a BCF\2\2 magic string (BCF\2\1 in the 4.1 and 4.2 specs) followed by a 4-byte uint32_t length followed by the NUL-terminated VCF header text. In particular, it needs to mention the length field and the NUL termination, neither of which it currently mentions. The best way to do this would be to copy the Field/magic/l_text/text rows from the table in the quick reference, so that this is described via a similar table to BCFv2_qref or BAM in SAMv1 §4.2.

The quick reference document (which is currently the only documentation of this outwith the implementations) says that the VCF header text is always NUL-terminated and that _ltext includes the NUL padding character. (This differs from BAM, where the NUL is optional.) In practice, HTSlib has always written a NUL terminator here and it appears HTSlib has been able to accept non-NUL-terminated BCF headers since v1.4 (see samtools/htslib@37c85e4b9737abc88871daf724552959b612e7cb). I haven't researched what other implementations require w.r.t. NUL termination. I would suggest the compatible route is to stick with the existing documentation from the qref and require the NUL terminator character.

Whether and how to extend this (perhaps in a future VCF version) to allow more than 4Gb of headers is a separate question.

pd3 commented 5 years ago

I added the "enhancement" label. This issue was not intended as a duplicate of #138, which only comments on the missing l_text field in the specification, but asks if the new version of BCF shouldn't extend this field to 8 bytes.