samtools / htslib

C library for high-throughput sequencing data formats
Other
784 stars 447 forks source link

Fix indexing bug by flushing BCF bgzf stream after header write #1742

Closed daviesrob closed 4 months ago

daviesrob commented 4 months ago

bcf_idx_init() calls bgzf_tell() to get the starting index offset. This was OK when single-threaded but broke with multiple threads because bgzf_tell() lies about the file offset unless bgzf_flush() was called first. SAM.gz, BAM and VCF.gz all did this, but BCF didn't leading to an incorrect first index entry when combining multi-threads with indexing on the fly. Fix by adding the missing bgzf_flush() after writing the header.

As a side benefit, the BCF variant records will now start in a fresh BGZF block, instead of being mixed in with part of the BCF header.

test/index.bcf.csi has to be replaced due to the extra flush adding one more block to the (uncompressed) index.bcf file that gets generated by the test harness.

Fixes #1740