Speed up the VCF parser

jkbonfield commented 12 months ago

This is an amalgamation of many small optimisations which jointly can have a significant effect.

I've benchmarked it on data that is very heavy on INFO fields, very heavy on FORMAT fields (many samples), and more "normal" single sample data. Also tested it with the system gcc (7), gcc 13, and clang 13. All with default optimisation levels (-O2).

With gcc7 it's 26 to 75% quicker, gcc13 is 23-53% quicker and clang13 is 12-57% quicker. Depending on how modern the host is and whether it's INFO or FORMAT dominated.

A chunk of GIAB truth set.  Single sample:

        Seconds for dev / PR
        seq4d           seq4-head1
gcc7    12.62 / 7.17    10.13 / 6.49
gcc13   10.65 / 6.96     8.65 / 6.29
clang13 11.87 / 7.55     8.61 / 6.49

gnomad-g1k.10k.vcf (many-samples):

        seq4d           seq4-head1
gcc7    10.62 / 8.46    8.29 / 6.58
gcc13   11.11 / 8.80    8.21 / 6.68
clang13 11.91 / 9.23    8.14 / 6.94

gnomad.sites.50k.vcf - no samples but very heavy in INFO field:

        seq4d           seq4-head1
gcc7    8.27 / 4.83     5.91 / 3.84
gcc13   6.97 / 4.76     4.99 / 3.84
clang13 6.43 / 5.06     4.49 / 4.01

We may wish to test these same files on a Mac as the speed could well be very dependent on the efficiency of the C library. Ask me for details, or use your own files.

Small tests, but representative of a variety of data types. Obviously it'll depend on what contents there are in the files.

The changes are fairly non-invasive and there is no fundamental change of algorithm here. Just careful optimisation of what is there.

Larger scale rewrites may offer more potential, such as merging the find-max + alloc + fill-buffer nature of FORMAT handling. It may be better to store integer fields in a 4-byte fixed width instead of the smallest width, avoiding the max calculation stage. The extra memory isn't that bad I'd think. Strings are a bit trickier, but if there's genuinely differences in string length then it's likely that storing packed strings with an index into the buffer is smaller than expanding to their maximum size. Similar for other arrays. These may well have a minimal impact on memory usage (maybe even shrink it in some cases), while paving the way for a single-pass parser. It's a lot more work to do though so it's parked for now.

Even better would be to simply go with VCF API v2 that doesn't do lots of parsing in the first place, and has a more API driven query that processes just-in-time and the smallest amount we need. We spend a lot of time in hashing for example, which is irrelevant if we don't actually manipulate that key. Similar to #1081. That had a huge huge speed up (3-4x) for some operations, but was essentially vetoed if I recall. However if we're breaking the API then it's clearly the way to go. (That PR didn't break the API though provided people used the proper functions instead of just grubbing around internally without first calling unpack).

For now this is many commits as it can make it easier to review. In particular the first commit puts blocks around the main function (which speeds it up by itself) and moves most veriables to become block-local, while the second is simply lifting those blocks into smaller functions for better code structuring (and no other changes). From then on it's more piecemeal tweaks.

A summary of changes:

Split up vcf_parse_format into 7 separate functions, for ease of understanding.
Hints for const and caching of lookups (eg bcf_hdr_nsamples). Shouldn't really change anything, but it's good practice and it's possible the compilers can't optimise as well as we'd think due to potential aliasing.
Removal of str2uint calls for simple single-digit GTs
Remove many calls of kputc in bcf_enc_vint and bcf_enc_size
Unrolled bcf_enc_vint array handling to encourage SIMD generation by the compilers
Replace manual byte-by-byte loops with strchr or memchr (or for writing memset). These are typically implemented in 32-bit or 64-bit words and considerably faster than the naive approaches we do. This is the bit we may wish to test on other OSes, as I have my own alternatives if needs be (see commit https://github.com/samtools/htslib/pull/1644/commits/1db45435c5ff167e3795c429d6a2f019f85ac0b8)
Replace strcmp(t, "GT") with manual check. No effect on modern compilers, but old compilers do actually call strcmp here. Similarly strcmp(p, ".") and strcmp(key, "END") are rewritten with memcmp as the compilers normally inline that for short fixed strings. (We know it'll be nul terminated already due to kstrtok.)
Speed up max value parsing via lookup table and/or range checking.
Plus more comments and documentation (it could do with a lot more still)

I've tested it via the bcftools test harness too, which initially found a bug I'd missed.

TODO: VCF writing speed. That'll be a new PR, but I can already see it's got lots of inefficient use of kputc everywhere.

jkbonfield commented 11 months ago

Rebased and a massive squashathon.

Retested dev vs this on a couple platforms with 3 compilers and 3 datasets.

One thing that needs testing is other systems, eg MacOS. There is some assumption on the C library being efficient, particularly things like strchr. It's true with glibc, but not sure if this is universally true.

daviesrob commented 11 months ago

Merged including the separate commit that reverts the vcf_parse_info() changes - I think it's useful to keep the history there should we want to revisit that area later.

samtools / htslib

Speed up the VCF parser #1644