Improve support for very long input lines (> 2Gbyte)

daviesrob commented 1 year ago

Some changes to make reading and writing very long lines work better. It is a partial fix for #1539 - the overflow mentioned in dealt with, but more work will be needed to make the very long VCF records readable.

Cap the return value of bgzf_getline() to avoid integer overflow.
Change data types in tbx_parse1() and vcf_write() so they can handle long lines (albeit only on 64-bit platforms).
Put an extra loop around read() and write() in fd_read(), fd_write() to ensure they really have processed the expected amount of data. This is especially important on Linux, which has a limit on the amount of data that will be read or written in a single call.

This is mostly useful for tabix, which doesn't do much interpretation of its input. With these changes it will happily index and return long lines if it has enough memory.

This does not change the maximum size of a SAM record, as that is limited by bam1_t::l_data which is an int. The situation for VCF is a bit more complicated, and it may be possible to get very slightly over 2Gbytes as various limits apply to different parts of the record. The size of the sample data, which is likely to be the biggest part, is currently restricted to 2Gbytes by this check, and by the size of bcf_fmt_t::p_off which is 31 bits. It might be possible to work around the p_off limit by abusing the ability of bcf_fmt_t to store edited data in a separate buffer - but doing so would be a bit hairy and would need a lot of thought and testing.

jmarshall commented 1 year ago

Re the fd_read()/fd_write() change: as documented in struct hFILE_backend in _hfileinternal.h, it is expected that these functions might not read or write their entire buffers.

For fd_write(), it is always called via hFILE_backend::write(). There are two invocations of that in the hfile.c front-end code, and both of them implement such loops to ensure all data is written.

For fd_read(), it is always called via hFILE_backend::read(). There are three invocations of that: that in hread2() implements a loop to ensure the bulk of a large request is read; that in refill_buffer() does not have a loop but records the number of bytes read, and its callers loop as necessary; the final one is in multipart_read(), which is itself an hFILE_backend::read() method that will be recalled in a loop as necessary by the front-end functions.

So it should not be necessary to have this sort of loop inside the back-end functions. Did you encounter a bug without this commit? If so, that would seem to signal a bug in the front-end code instead.

Moreover, as mentioned in the refill_buffer() doc string, it only calls the back-end reader once and may not completely refill the buffer. It's implied :smile: that the back-end function is only going to do a system call once too. This was intentional, written this way with interactively typing a SAM file into a terminal in mind. In that scenario, IIRC each read(2) system call would return one line after the user pressed Enter. Doing only one read(2) system call for each sam_read1() call means that sam_read1() does not block when reading from an interactive terminal, which seemed desirable at the time. (Similar considerations might apply for reading from the network.)

(I don't recall whether this was described in the commit messages of the time.)

daviesrob commented 1 year ago

Yes, you're right. It does work without the fd_read()/fd_write() commit. I possibly thought they were a problem because I hadn't fixed the vcf_write() type at that point, so the whole process still wasn't working and I maybe didn't go far enough in the debugger.

I'll remove that commit.

samtools / htslib

Improve support for very long input lines (> 2Gbyte) #1542