samtools / htslib

C library for high-throughput sequencing data formats
Other
783 stars 447 forks source link

Using `bgzf_read_small()` breaks subsequent `bgzf_useek()` calls #1798

Open jmarshall opened 3 days ago

jmarshall commented 3 days ago

PR #1772 added a simplified inline version of bgzf_read() that uses inline code when the request can be satisfied directly from the buffer, otherwise punts to the real bgzf_read(). Very laudable. However I encountered seek problems when using it.

I rederived the inline function by starting with the code of the full bgzf_read(), assuming the invariant that length < fp‑>block_length - fp‑>block_offset, and simplifying the code accordingly. I ended up with a function that is fairly similar to bgzf_read_small() as added to htslib/bgzf.h but with some additions:

static inline ssize_t bgzf_read_small(BGZF *fp, void *data, size_t length) {
     // A block length of 0 implies current block isn't loaded (see
     // bgzf_seek_common).  That gives negative available so careful on types
     if ((ssize_t)length < fp->block_length - fp->block_offset) {
         // Short cut the common and easy mode
         memcpy((uint8_t *)data,
                (uint8_t *)fp->uncompressed_block + fp->block_offset,
                length);
         fp->block_offset += length;
+
+        // For raw gzip streams this avoids short reads.
+        if (fp->block_offset == fp->block_length) {
+            fp->block_address = bgzf_htell(fp);
+            fp->block_offset = fp->block_length = 0;
+        }
+
+        fp->uncompressed_address += length;
+
         return length;
     } else {
         return bgzf_read(fp, data, length);
     }
 }

Because bgzf_read_small() does not currently update fp‑>uncompressed_address, subsequent bgzf_useek() calls may jump to the wrong location. And probably other functions use fp‑>uncompressed_address too and are affected.

The if block looks harder to deal with, because it calls bgzf_htell() which is private within bgzf.c. However it turns out that the invariant implies that this if will never be true, so in fact I should have simplified it away to nothing too. Phew.

So bgzf_read_small() just needs fp->uncompressed_address += length; added to it to make it equivalent to bgzf_read().

(I have not analysed bgzf_write_small() to see if it has any similar infelicities.)

daviesrob commented 2 days ago

I'd checked that the fp->block_address update wasn't needed while going over the PR, but evidently missed fp->uncompressed_address. What did you do to spot this? As the test harness currently passes with the incorrect change, it could do with extending to cover this problem.

jkbonfield commented 5 hours ago

Oops. I was sure I'd got both of those incremented, but I confess I had many alternative versions while benchmarking and I guess I lost something along the way. Sorry.

I'll try and get a test that fails before applying the fix. Thanks for reporting it.

jmarshall commented 5 hours ago

I spotted this while changing fai_retrieve() from bgzf_useek+bgzf_getcxLOTS to bgzf_useek + bgzf_read_smallxSOME — see PR #1799. As that PR mixes bgzf_useek and bgzf_read_small, with it applied there are already faidx test cases that fail indirectly due to this problem. But it would be good to test it directly too, rather than indirectly. The sequence of events is basically:

bgzf_read_small(…, buf1, 5);  // read five characters
bgzf_useek(…, 10); // seek to offset 10
bgzf_read_small(…, buf2, 3); // read another few characters; this one could be bgzf_read()
assert(buf2 is the three characters at offset 10…12);

This direct test case will fail because it will have read three characters from a different offset.

jkbonfield commented 2 hours ago

I was already working on something similar in test_bgzf.c which I've now added to verify I could trigger the bug, and then also applied the trivial one line fix. Thanks.