Closed jkbonfield closed 1 year ago
On the first glance it is not clear why is the new code faster. Can you please add some explanation why is that?
I know. I wouldn't have thought to try different things were it not for a profile showing this to be the dominant bit of code. So I just randomly tried different ways of writing the code. Not particularly well considered, but that's why I tried a few compilers to see that I'm not (at least not obviously) over-optimising for one case.
If I had to hand-wavingly explain it, I'd say it's because of the flow control with a goto outside of a for-switch combination, which may perhaps harm some of the standard optimisation analysis. That's pure speculation though.
Unfortunately, this may have been tuned for the wrong architecture. On our older machines it is about 5% faster on my test file, but on the newer ones it's about a second slower. It's possible that more adjustment may get it quicker on both, but for real gains we probably need to make more radical changes to format parsing.
Tested on 1000 genomes data (see https://github.com/brentp/vcf-bench) this gives on 3 separate trials:
OLD
NEW
The profile of NEW2 shows:
So that's 7.3, 4.8 and 8.6% faster for the 3 compiler tests, or about double that for this specific function (given it's ~50% of the total CPU).
As a bonus, it's a little less convoluted in the code too.