Closed tfwillems closed 7 years ago
This comes about from 5d114ebd by @jmarshall, maybe he'd like to comment on this to double check my comments are valid.
I'm unsure precisely of what sam_read1 is meant to return as there's actually no documentation on this whatsoever (I looked and found nothing)! I can't find anything that uses this return value either bar checking <= 0. However looking at the old code I guess we can surmise the intention was to return the size of the uncompressed data on disk, in which case it is correct.
The extra nul bytes (added to make CIGAR aligned on a 4-byte boundary and to work around bugs caused by aggressive auto-vectorisation causing unaligned accesses) aren't "read" so the return value is still correct, is it not?
What you appear to be wanting is for it to return the size of the object instead. This isn't either of the two alternatives (current vs your mod) as it also includes other internal fields too.
Do you have code which has been broken by this commit? If so I'd be curious to see what it is.
Hi James,
Thanks for the detailed reply! I definitely agree that the precise documentation regarding the return value is unclear (other than < 0 is a failure), but I thought the total number of read bytes was a reasonable assumption.
Unless I'm misunderstanding the bgzf_read function (whose 3rd argument specifies the number of bytes we aim to read), I think we can simply sum up the third arguments to this function to determine the total number of read bytes. This led to my comments above, which suggest that the total read memory is4 + block_len + c->l_extranul
bytes. Of course I'll happily defer to you on this, but that was my naive intuition.
I encountered this issue because I was attempting to determine the memory offset in the BAM file that preceded the most recently read alignment The curr_off field in hts_itr_t iterators nicely provides the current offset after reading the last alignment, but I was unable to find an easy way to determine the offset before reading the last alignment.
My approach for doing so was to keep track of i) a
, the return value of hts_itr_next(), which returns the value of bam_read1() for a successful read from a BAM file and ii) b
, the current offset from an hts_itr_t itr iterator using itr->curr_off . For 99.9% of my tests, it seemed that b-a
resulted in the correct offset I'm looking for. However, for a small subset of cases, using b-a
led to an error in bgzf_read and I thought this might be a potential explanation.
Is there a simpler way of accomplishing the above aim using the current api?
Thanks, Thomas
On second thought, I didn't notice that c->l_qname
changes value right before the last call to bgzf_read(), so I think my argument is no longer valid. As a result, there likely isn't a bug, so apologies for my oversight.
I would still be interested to hear your thoughts on how to determine the offset within a BAM file for the beginning of the most recently read alignment (and why the above may occasionally fail).
Best, Thomas
Indeed the amount of data read hasn't changed and nor has the read value of block_len, so the return value (block_len+4) is the same as before. All the other stuff is just obfuscation. with regards to return values. :-)
Regarding the offset, it's not a part of the code I know yet, although I suspect I'll be getting aquainted soon due to bugs reported there.
I'm not sure exactly what you're trying to do here, but have you considered using bgzf_tell()
? If you look at the source for bam_index()
you'll see (https://github.com/samtools/htslib/blob/7bfa010fd/sam.c#L480) that it uses bgzf_tell()
to get the positions to store in the index.
Getting the file position of a BAM read is complicated as the file is made up of a series of compressed blocks. You need both the location where the block starts in the file, and the offset of the record in the uncompressed version of that block. bgzf_tell()
returns these as a single value that can be passed
back to bgzf_seek() in order to reposition the file pointer at the same location.
Thanks for the suggestion. I indirectly was using bgzf_tell, as the hts_itr_t type stores the current value of this function in curr_off after reading every new alignment. However, I believe this only tells you the current offset after reading an alignment, and I was interested in computing the offset right before reading the alignment.
If I understand correctly, when the bam_read1 function successfully reads a BAM record, it should return the number of bytes read from the bgzf file. However, I believe that the return value is occasionally incorrect when the length of the query name is not divisible by 4.
Currently, there are four calls to bgzf_read in the bam_read1 function in sam.c:
In total, this should read
4 + 32 + c->l_qname + b->l_data - c->l_qname = 36 + b->l_data
bytes.As
b->l_data = block_len - 32 + c->l_extranul
, the total number of read bytes should be:However, upon success, the function currently returns
4 + block_len
. As a result, I believe the return value is incorrect ifc->l_extranul != 0
, which occurs whenc->l_qname%4 != 0
The fix is relatively simple in terms of having the function return
4 + block_len + c->l_extranul
.Best, Thomas