samtools / htslib

C library for high-throughput sequencing data formats
Other
809 stars 446 forks source link

hts_itr_query fails on empty BCFs with gapped tid blocks #1534

Closed pd3 closed 1 year ago

pd3 commented 1 year ago

This is related to https://github.com/samtools/htslib/issues/1533 and concerns BCFs with edited headers and missing data records. Consider this example

$ echo -e '##fileformat=VCFv4.2\n##contig=<ID=chr22,length=51304566,IDX=1>\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\nchr22\t1\t.\tA\tC\t.\t.\t.' > test.vcf
$ bcftools view test.vcf -o test.bcf
$ bcftools index test.bcf

Note the header line contains the field IDX=1 which makes it behave as if BCF was edited and the first chromosome with IDX=0 was removed.

Prior to the commit d64e710 this command fails with

$ bcftools isec -n 2 test.bcf test.bcf
bcftools: vcf.c:2223: bcf_hdr_seqnames: Assertion `tid<m' failed.
Aborted

with the fix applied, it works

$ bcftools-d64e710 isec -C test.bcf test.bcf
chr22   1   A   C   11

However, if there are no data records, hts_itr_query returns an error and the program fails

$ echo -e '##fileformat=VCFv4.2\n##contig=<ID=chr22,length=51304566,IDX=1>\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO' > test2.vcf
$ bcftools view test2.vcf -o test2.bcf
$ bcftools index test2.bcf
$ bcftools isec -n 2 test2.bcf test2.bcf
[E::_reader_seek] Could not seek: chr22:1-17592186044415
bcftools: synced_bcf_reader.c:532: _reader_seek: Assertion `0' failed.
Aborted

This problem does not appear when the chromosome tid block has no gaps, i.e. starts with IDX=0

$ echo -e '##fileformat=VCFv4.2\n##contig=<ID=chr22,length=51304566,IDX=0>\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO' > test3.vcf
$ bcftools view test3.vcf -o test3.bcf
$ bcftools index test3.bcf
$ bcftools isec -n 2 test3.bcf test3.bcf
daviesrob commented 1 year ago

The difference is due to the index on the file with no data records claiming that there is only one reference, while the one with data records says there are two. In the no-data case, trying to look up the index entry causes hts_itr_query() to return NULL thanks to this check on tid triggering this assertion in _reader_seek().

The initial count of the number of references is made in idx_calc_n_lvls_ids(), which does not check the headers for IDX= values. Once you start adding data records, the problem if fixed up by hts_idx_push() which expands the index if it gets a tid value outside the expected range.

I'm not sure if this counts as an indexing error, or if hts_itr_query() should be more generous when it's given a tid with no corresponding index entry.

As a side note: the idx->bidx[tid] == NULL test in hts_itr_query() only works on refs with no data because idx_read_core() makes bidx[] entries for everything, even if there's nothing in the index for a given reference. The same wouldn't be true for an index you've just built though - for them you only have a bidx entry if some data existed. At the moment trying to look up a reference with no data on an index you've just built would fail.