Closed pd3 closed 1 year ago
I traced the problem down to https://github.com/samtools/htslib/blob/a59bc5256ee4f539ae47edd575dc1340356d922a/tbx.c#L211-L217
The tbx_readrec()
function returns the length of the string on success whereas the corresponding bcf_readrec()
returns 0. The iterator then returns this value via hts_itr_next()
https://github.com/samtools/htslib/blob/a59bc5256ee4f539ae47edd575dc1340356d922a/hts.c#L3909-L3917
Arguably the iterator should behave consistently across all data types.
Continuing with this, it seems that bcf_sr_seek
is a wrong solution for the task: bcftools consensus
wants to query a region similarly to bcf_sr_set_regions
, but on the fly. Instead the function attempts to seek to a contig and if not present, a streaming-like mode is entered based on the order of contigs in the header (BCF) or tabix index. In this mode repeated seeks to a non-existent locations result in different outcome from subsequent bcf_sr_next_line
calls.
A desired fix should (1) make repeated seek+next_line calls consistent and (2) provide on-the-fly alternative to bcf_sr_set_regions
When
bcf_sr_seek()
is called on non-existent region of VCF vs BCF different return status is returned and the internal state seems to change in a way which makes a subsequent call ofbcf_sr_next_line
fail. Consider the following example of the VCF/BCF file:and the code
Running on VCF vs BCF we get
This difference is the inherent cause of https://github.com/samtools/bcftools/issues/1918