samtools / htslib

C library for high-throughput sequencing data formats
Other
799 stars 445 forks source link

A minor speed improvement to VCF parsing by 5-9% #1513

Closed jkbonfield closed 1 year ago

jkbonfield commented 1 year ago

Tested on 1000 genomes data (see https://github.com/brentp/vcf-bench) this gives on 3 separate trials:

OLD

clang13 -O2 31.60 31.61 31.56
gcc12   -O2 29.32 29.36 29.33
gcc12   -O3 30.90 30.87 30.84

NEW

clang13 -O2 29.30 29.30 29.62
gcc12   -O2 27.89 27.96 27.92
gcc12   -O3 28.25 28.19 28.20

The profile of NEW2 shows:

    51.41%  read     libhts.so            [.] vcf_parse_format
    21.31%  read     libhts.so            [.] bgzf_getline
     9.50%  read     libhts.so            [.] bcf_enc_vint
     9.42%  read     libdeflate.so.0      [.] deflate_decompress_default
     3.22%  read     libdeflate.so.0      [.] crc32_pclmul_avx

So that's 7.3, 4.8 and 8.6% faster for the 3 compiler tests, or about double that for this specific function (given it's ~50% of the total CPU).

As a bonus, it's a little less convoluted in the code too.

pd3 commented 1 year ago

On the first glance it is not clear why is the new code faster. Can you please add some explanation why is that?

jkbonfield commented 1 year ago

I know. I wouldn't have thought to try different things were it not for a profile showing this to be the dominant bit of code. So I just randomly tried different ways of writing the code. Not particularly well considered, but that's why I tried a few compilers to see that I'm not (at least not obviously) over-optimising for one case.

If I had to hand-wavingly explain it, I'd say it's because of the flow control with a goto outside of a for-switch combination, which may perhaps harm some of the standard optimisation analysis. That's pure speculation though.

daviesrob commented 1 year ago

Unfortunately, this may have been tuned for the wrong architecture. On our older machines it is about 5% faster on my test file, but on the newer ones it's about a second slower. It's possible that more adjustment may get it quicker on both, but for real gains we probably need to make more radical changes to format parsing.