samtools / htslib

C library for high-throughput sequencing data formats
Other
803 stars 446 forks source link

Check interval start to avoid overflowing bin numbers #1774

Closed jmarshall closed 5 months ago

jmarshall commented 5 months ago

As noted on https://github.com/samtools/samtools/pull/2032 (which includes test queries exhibiting the problem):

Additionally with this data file samtools goes into an infinite(?) loop when given [a particular location / query interval]: […]

This can be traced to the do … while loop in hts_itr_query(), which loops perhaps forever through negative bin numbers. Possibly this code should check for a beg value massively beyond the maximum position covered by the index's bins and sidestep most of its processing.

This PR checks start positions of query intervals against the maximum position representable in the index's geometry, to avoid negative bin numbers and the resulting infinite loops in the do...while loop. It's effectively the beg equivalent of #1595.

I considered implementing the “massively beyond” bit by e.g. adding 256 as various parts of the code do for somewhat different reasons. However tracing the guarded code in hts_itr_query() and hts_itr_multi_bam() (and looking at the checks for data getting into the index at all) suggests that the result would be a finished / unchanged iterator for beg immediately past maxpos anyway — so having the bound exactly at maxpos is fine.

Introduces hts_bin_maxpos() and hts_idx_maxpos(), and uses them wherever the maxpos calculation appears. I've left the latter private, at least for now.

This also changes the existing end checks to <= as end is exclusive -- note it is used as end-1 in the code guarded by the checks. In practice this probably won't make much difference to anything.

jkbonfield commented 5 months ago

Thanks John.