Closed jkbonfield closed 11 months ago
Rebased and a massive squashathon.
Retested dev vs this on a couple platforms with 3 compilers and 3 datasets.
One thing that needs testing is other systems, eg MacOS. There is some assumption on the C library being efficient, particularly things like strchr
. It's true with glibc, but not sure if this is universally true.
Merged including the separate commit that reverts the vcf_parse_info()
changes - I think it's useful to keep the history there should we want to revisit that area later.
This is an amalgamation of many small optimisations which jointly can have a significant effect.
I've benchmarked it on data that is very heavy on INFO fields, very heavy on FORMAT fields (many samples), and more "normal" single sample data. Also tested it with the system gcc (7), gcc 13, and clang 13. All with default optimisation levels (-O2).
With gcc7 it's 26 to 75% quicker, gcc13 is 23-53% quicker and clang13 is 12-57% quicker. Depending on how modern the host is and whether it's INFO or FORMAT dominated.
We may wish to test these same files on a Mac as the speed could well be very dependent on the efficiency of the C library. Ask me for details, or use your own files.
Small tests, but representative of a variety of data types. Obviously it'll depend on what contents there are in the files.
The changes are fairly non-invasive and there is no fundamental change of algorithm here. Just careful optimisation of what is there.
Larger scale rewrites may offer more potential, such as merging the find-max + alloc + fill-buffer nature of FORMAT handling. It may be better to store integer fields in a 4-byte fixed width instead of the smallest width, avoiding the max calculation stage. The extra memory isn't that bad I'd think. Strings are a bit trickier, but if there's genuinely differences in string length then it's likely that storing packed strings with an index into the buffer is smaller than expanding to their maximum size. Similar for other arrays. These may well have a minimal impact on memory usage (maybe even shrink it in some cases), while paving the way for a single-pass parser. It's a lot more work to do though so it's parked for now.
Even better would be to simply go with VCF API v2 that doesn't do lots of parsing in the first place, and has a more API driven query that processes just-in-time and the smallest amount we need. We spend a lot of time in hashing for example, which is irrelevant if we don't actually manipulate that key. Similar to #1081. That had a huge huge speed up (3-4x) for some operations, but was essentially vetoed if I recall. However if we're breaking the API then it's clearly the way to go. (That PR didn't break the API though provided people used the proper functions instead of just grubbing around internally without first calling unpack).
For now this is many commits as it can make it easier to review. In particular the first commit puts blocks around the main function (which speeds it up by itself) and moves most veriables to become block-local, while the second is simply lifting those blocks into smaller functions for better code structuring (and no other changes). From then on it's more piecemeal tweaks.
A summary of changes:
vcf_parse_format
into 7 separate functions, for ease of understanding.bcf_hdr_nsamples
). Shouldn't really change anything, but it's good practice and it's possible the compilers can't optimise as well as we'd think due to potential aliasing.kputc
inbcf_enc_vint
andbcf_enc_size
bcf_enc_vint
array handling to encourage SIMD generation by the compilersstrchr
ormemchr
(or for writingmemset
). These are typically implemented in 32-bit or 64-bit words and considerably faster than the naive approaches we do. This is the bit we may wish to test on other OSes, as I have my own alternatives if needs be (see commit https://github.com/samtools/htslib/pull/1644/commits/1db45435c5ff167e3795c429d6a2f019f85ac0b8)strcmp(t, "GT")
with manual check. No effect on modern compilers, but old compilers do actually call strcmp here. Similarlystrcmp(p, ".")
and strcmp(key, "END") are rewritten withmemcmp
as the compilers normally inline that for short fixed strings. (We know it'll be nul terminated already due to kstrtok.)I've tested it via the bcftools test harness too, which initially found a bug I'd missed.
TODO: VCF writing speed. That'll be a new PR, but I can already see it's got lots of inefficient use of
kputc
everywhere.