Closed jkbonfield closed 1 year ago
AMD Zen4 benchmarks:
Disabling mitigations basically has no impact here, so master-on and master-off doesn't matter. What's important is the difference between master branch and this branch.
Nothing major basically, but it's good to see we haven't harmed non-Intel systems in this work.
This has been hanging around long enough that it's being problematic to test the newer fuzzing fixes, so I have a branch which merges this, various bits from master and other outstanding PRs, plus a new commit to fix another fuzz testing bug:
https://github.com/jkbonfield/htscodecs/tree/sim3-tidy-fuzz2
The last commit there fixes the bug AVX512 gather mem bounds bug we discussed in the office, but it's not viable to do as a PR to master as it requires this PR first. It's messy, but I'm hoping once all the other PRs get merged I can replace this PR with the new branch.
Also incorporates the old #92.
Intel Downfall / GDS security flaw caused the issue of an updated microcode to mitigate the problem. Unfortunately the consequence of this is that SIMD gather instructions are now considerably slower, particularly AVX2. So much so that it turns out saving the SIMD register to memory, using traditional scalar code to do the memory fetches one at a time and then loading those scalar fetches back into the SIMD register, is considerably faster than the AVX2 gather microcode instruction. The mind boggles, but there we are. As it turns out I was already exploring these sort of things for AMD Zen4 too as that has (had) a far slower gather instruction than Intel.
We could do on-the-fly measurements and work out how far the gather is, so we know whether to use a simulated one or a real one, but it's muddled somewhat as the scalar code consists of many instructions which can get reordered by the compiler, so how efficient it is depends on the context. Similarly with a single instruction, the cost of it depends on how many other instructions we can do while waiting on the result - whether we can hide the instruction latency basically.
For now, I've disabled real gathers entirely, but they can be reenabled using
-DUSE_GATHER
cpp define.The plot below shows the speed for an Intel CascadeLake CPU with the microcode patch applied, booted as-is and with the "mitigations=off" linux kernel parameter. (That turns off spectre and meltdown too, but I've also benchmarked the simulated code with/without mitigations and demonstrated it's not impacted.)
Some comments:
With that final mitigation in place, the performance hit by using simulated gathers over actual gathers on an Intel machine without the new microcode (master=off) is in the 10-20% bracket. That's probably acceptable, given the performance loss of not changing this code could be 65% (2.2x slower for AVX2 decode).
Also included below is the same system compiled using gcc 13.
I'll produce some AMD Zen4 (Genoa) plots soon.