samtools / htslib

C library for high-throughput sequencing data formats
Other
784 stars 447 forks source link

Fix arithmetic overflow in load_ref_portion() on very long refs #1738

Closed daviesrob closed 4 months ago

daviesrob commented 4 months ago

While the Mistletoe reference has been chopped into segments of less than 2^31 bases, they were still long enough to cause an overflow in the load_ref_portion() len calculation. This was due to the line endings taking the total over INT_MAX. Fix be changing the data type of start and end to hts_pos_t so the entire calculation is done on 64-bit values.

Callers to load_ref_portion() are also updated where necessary to use hts_pos_t, making long references more likely to work should they ever be supported in CRAM. All callers of callers were already using 64-bit values due to earlier upgrades.

There's a potential complication over calculating the MD5 checksums as the exported hts_md5_update() function takes an unsigned long for the length. On 64-bit platforms with 32-bit unsigned long (i.e. Windows) it is necessary to add a loop if the reference is a very long one. On platforms with 64-bit long a single call is still used, and the loop should be optimised out.

Fixes #1734 (CRAM load_ref_portion() fails on some Mistletoe references)

jkbonfield commented 4 months ago

If the reference portion is longer than 32-bit then we're seriously looking at rediculously shallow data! Having a dependency on 2GB+ worth of data for a single CRAM slice is going to be an I/O bottleneck. We'd probably be best just giving up and doing referenceless encoding and accepting the compression degredation as a good trade-off for speed.

I thought I'd already done this somewhere already. Probably in embedded references at a guess.

daviesrob commented 4 months ago

The test file I have is fairly shallow data, and the saving from having a reference is significant:

$ ls -1sh mistletoe/
total 163G
45G drVisAlbu1.merge.embed1.cram
34G drVisAlbu1.merge.embed2.cram
25G drVisAlbu1.merge.fasta_refs.cram
25G drVisAlbu1.merge.md5_refs.cram
37G drVisAlbu1.merge.no_ref.cram