Downfall mitigation via simulated gathers

jkbonfield commented 1 year ago

Also incorporates the old #92.

Intel Downfall / GDS security flaw caused the issue of an updated microcode to mitigate the problem. Unfortunately the consequence of this is that SIMD gather instructions are now considerably slower, particularly AVX2. So much so that it turns out saving the SIMD register to memory, using traditional scalar code to do the memory fetches one at a time and then loading those scalar fetches back into the SIMD register, is considerably faster than the AVX2 gather microcode instruction. The mind boggles, but there we are. As it turns out I was already exploring these sort of things for AMD Zen4 too as that has (had) a far slower gather instruction than Intel.

We could do on-the-fly measurements and work out how far the gather is, so we know whether to use a simulated one or a real one, but it's muddled somewhat as the scalar code consists of many instructions which can get reordered by the compiler, so how efficient it is depends on the context. Similarly with a single instruction, the cost of it depends on how many other instructions we can do while waiting on the result - whether we can hide the instruction latency basically.

For now, I've disabled real gathers entirely, but they can be reenabled using -DUSE_GATHER cpp define.

The plot below shows the speed for an Intel CascadeLake CPU with the microcode patch applied, booted as-is and with the "mitigations=off" linux kernel parameter. (That turns off spectre and meltdown too, but I've also benchmarked the simulated code with/without mitigations and demonstrated it's not impacted.)

Some comments:

All values are rates in MB/s, so higher is better. Blue is encode, using the left Y axis. Green is decode, using the right Y axis.
master-on and master-off are mitigations=on and mitigations=off with the master branch, using actual SIMD gathers. SimG-on is simulated gathers, with mitigations=on (ie Downfall bug fixed).
We see for AVX2 encode speed is unaffected, but with AVX512 we do get an encode performance hit. Both have decode performance hits, but AVX2 more so.
This PR (SimG-on) rescues most of the speed lost for AVX2.
This PR rescues some of the speed for AVX512, but it's quite challenging, particularly AVX512 order-1 encode.
If AVX512 is slower than AVX2, we don't have to use it. We already had logic for that with AMD, so this has been extended. Hence AVX512 slow encodes are mostly mitigated by using the AVX2 encoder.

With that final mitigation in place, the performance hit by using simulated gathers over actual gathers on an Intel machine without the new microcode (master=off) is in the 10-20% bracket. That's probably acceptable, given the performance loss of not changing this code could be 65% (2.2x slower for AVX2 decode).

Also included below is the same system compiled using gcc 13.

I'll produce some AMD Zen4 (Genoa) plots soon.

jkbonfield commented 1 year ago

AMD Zen4 benchmarks:

Disabling mitigations basically has no impact here, so master-on and master-off doesn't matter. What's important is the difference between master branch and this branch.

AVX2 order-0: no real change.
AVX2 order-1: slight gain in encode speed, slight loss in decode speed. Nothing large.
AVX512 order-0: small encode speed lost (unused), and even smaller decode speed gain. Note however the AVX2 encoder is faster and will be used anyway.
AVX512 order-1: small gain to both encode (but still unused) and decode speeds.

Nothing major basically, but it's good to see we haven't harmed non-Intel systems in this work.

jkbonfield commented 1 year ago

This has been hanging around long enough that it's being problematic to test the newer fuzzing fixes, so I have a branch which merges this, various bits from master and other outstanding PRs, plus a new commit to fix another fuzz testing bug:

https://github.com/jkbonfield/htscodecs/tree/sim3-tidy-fuzz2

The last commit there fixes the bug AVX512 gather mem bounds bug we discussed in the office, but it's not viable to do as a PR to master as it requires this PR first. It's messy, but I'm hoping once all the other PRs get merged I can replace this PR with the new branch.

samtools / htscodecs

Downfall mitigation via simulated gathers #96