Closed daviesrob closed 4 months ago
If the reference portion is longer than 32-bit then we're seriously looking at rediculously shallow data! Having a dependency on 2GB+ worth of data for a single CRAM slice is going to be an I/O bottleneck. We'd probably be best just giving up and doing referenceless encoding and accepting the compression degredation as a good trade-off for speed.
I thought I'd already done this somewhere already. Probably in embedded references at a guess.
The test file I have is fairly shallow data, and the saving from having a reference is significant:
$ ls -1sh mistletoe/
total 163G
45G drVisAlbu1.merge.embed1.cram
34G drVisAlbu1.merge.embed2.cram
25G drVisAlbu1.merge.fasta_refs.cram
25G drVisAlbu1.merge.md5_refs.cram
37G drVisAlbu1.merge.no_ref.cram
While the Mistletoe reference has been chopped into segments of less than 2^31 bases, they were still long enough to cause an overflow in the load_ref_portion()
len
calculation. This was due to the line endings taking the total over INT_MAX. Fix be changing the data type ofstart
andend
tohts_pos_t
so the entire calculation is done on 64-bit values.Callers to load_ref_portion() are also updated where necessary to use
hts_pos_t
, making long references more likely to work should they ever be supported in CRAM. All callers of callers were already using 64-bit values due to earlier upgrades.There's a potential complication over calculating the MD5 checksums as the exported hts_md5_update() function takes an unsigned long for the length. On 64-bit platforms with 32-bit unsigned long (i.e. Windows) it is necessary to add a loop if the reference is a very long one. On platforms with 64-bit long a single call is still used, and the loop should be optimised out.
Fixes #1734 (CRAM load_ref_portion() fails on some Mistletoe references)