Closed daviesrob closed 1 year ago
Re the fd_read()
/fd_write()
change: as documented in struct hFILE_backend
in _hfileinternal.h, it is expected that these functions might not read or write their entire buffers.
For fd_write()
, it is always called via hFILE_backend::write()
. There are two invocations of that in the hfile.c front-end code, and both of them implement such loops to ensure all data is written.
For fd_read()
, it is always called via hFILE_backend::read()
. There are three invocations of that: that in hread2()
implements a loop to ensure the bulk of a large request is read; that in refill_buffer()
does not have a loop but records the number of bytes read, and its callers loop as necessary; the final one is in multipart_read()
, which is itself an hFILE_backend::read()
method that will be recalled in a loop as necessary by the front-end functions.
So it should not be necessary to have this sort of loop inside the back-end functions. Did you encounter a bug without this commit? If so, that would seem to signal a bug in the front-end code instead.
Moreover, as mentioned in the refill_buffer()
doc string, it only calls the back-end reader once and may not completely refill the buffer. It's implied :smile: that the back-end function is only going to do a system call once too. This was intentional, written this way with interactively typing a SAM file into a terminal in mind. In that scenario, IIRC each read(2)
system call would return one line after the user pressed Enter. Doing only one read(2)
system call for each sam_read1()
call means that sam_read1()
does not block when reading from an interactive terminal, which seemed desirable at the time. (Similar considerations might apply for reading from the network.)
(I don't recall whether this was described in the commit messages of the time.)
Yes, you're right. It does work without the fd_read()/fd_write()
commit. I possibly thought they were a problem because I hadn't fixed the vcf_write()
type at that point, so the whole process still wasn't working and I maybe didn't go far enough in the debugger.
I'll remove that commit.
Some changes to make reading and writing very long lines work better. It is a partial fix for #1539 - the overflow mentioned in dealt with, but more work will be needed to make the very long VCF records readable.
bgzf_getline()
to avoid integer overflow.tbx_parse1()
andvcf_write()
so they can handle long lines (albeit only on 64-bit platforms).read()
andwrite()
infd_read()
,fd_write()
to ensure they really have processed the expected amount of data. This is especially important on Linux, which has a limit on the amount of data that will be read or written in a single call.This is mostly useful for
tabix
, which doesn't do much interpretation of its input. With these changes it will happily index and return long lines if it has enough memory.This does not change the maximum size of a SAM record, as that is limited by
bam1_t::l_data
which is anint
. The situation for VCF is a bit more complicated, and it may be possible to get very slightly over 2Gbytes as various limits apply to different parts of the record. The size of the sample data, which is likely to be the biggest part, is currently restricted to 2Gbytes by this check, and by the size ofbcf_fmt_t::p_off
which is 31 bits. It might be possible to work around thep_off
limit by abusing the ability ofbcf_fmt_t
to store edited data in a separate buffer - but doing so would be a bit hairy and would need a lot of thought and testing.