Closed jkbonfield closed 1 year ago
Changing tactic from yesterday's meeting.
In the light of the Downfall mitigation, I've decided it's probably easiest just to incorporate this into the other PR which at some stage will be incoming, as I do need to test both in conjunction in order to be sure what's optimal there will still be optimal after this gets merged.
Spoiler for Downfall microcode changes: it totally hammers AVX2 and AVX512 speeds, by varying degrees (up to 3x slower in some cases). Some of it can be mitigated by replacing real gathers with scalar equivalents, but we're still typically 10-30% slower. A big improvement, but we may need two variants :/
Closing in lieu of a pending Downfall patch which will incorporate these other changes too as I needed to benchmark in conjunction with the reordered code.
The main speed increase here is to the AVX512 implementation, specifically focusing on improving gathers on systems with long delays, but there have also been some tweaks to AVX2 encoder too.
The impetus for this patch was coping better on AMD Zen4. It's no longer true that the AVX2 decoder outperforms the AVX512 one (although the encoder still does as I haven't investigated that much yet). The exception is with clang (especially clang13) where the AVX2 decoder runs very fast - way faster than gcc will produce too. This brings it ahead of the AVX512 code under clang.
I've yet to work out how clang is getting this speed, and how we can exploit that for other compilers. It's likely instruction reordering or choosing different intrinsic implementations.