Add an inline version of bgzf_read for small reads.

jkbonfield commented 2 months ago

The bgzf_read function is long and not something that's likely to get inlined even if it was in the header.

However most of the time our calls request a small amount of data and they fit within the buffer we've read, so we offer a static inline to do the memcpy when we can, falling back to the long function when we cannot.

In terms of CPU time it's not much difference, but the key thing is that it's often CPU time saved in a main thread given the bulk of the decode is often threaded. An example of test_view -B -@16

develop:

real    0m48.158s
user    6m2.901s
sys     0m28.134s

real    0m48.730s
user    6m3.707s
sys     0m28.473s

real    0m48.653s
user    6m5.215s
sys     0m28.637s

This PR:

real    0m41.731s
user    5m59.780s
sys     0m30.393s

real    0m41.945s
user    6m0.367s
sys     0m30.426s

So we can see it's a consistent win when threading, potentially 10-15% faster throughput depending on work loads.

jkbonfield commented 2 months ago

/usr/bin/time was reporting ~810% CPU utilisation for develop and ~940% for this PR, demonstrating the reduced time in main thread.

Commenting out the sanity checks in bam_read1 for if (c->n_cigar > 0) onwards increases that CPU utilisation to 1060% (1120% if I get rid of the tag2cigar call), leading to 37s real time. This shows that there is room for improvement, as these sort of checks ought to be executed in the multi-threaded decode instead. We thread the bgzf_read and not the bam_read.

This is why multi-threaded decoding of SAM and SAM.gz is sometimes more performant than BAM, despite the string->integer conversion overheads, as SAM parsing is entirely threadable.

daviesrob commented 3 weeks ago

Added a couple of extra tests...

samtools / htslib

Add an inline version of bgzf_read for small reads. #1772