With AVX512 we can use the embedded broadcast option to replicate constants from memory. This reduces the .data size quite a lot. All the constants are >=32bit in size, so they're expanded "for free" during load without lengthening the critical path.
I've switched to scalar broadcast all the AVX512 routines, plus a couple of minor optimizations (keeping shuffle masks in registers instead of reloading them each time, use VPTERNLOG for 3-way bitwise logic, and in generally trying to use instructions with shorter encoding).
Tested only on RocketLake and Linux, passes "make test" and test_checks.sh on my machine.
With AVX512 we can use the embedded broadcast option to replicate constants from memory. This reduces the .data size quite a lot. All the constants are >=32bit in size, so they're expanded "for free" during load without lengthening the critical path. I've switched to scalar broadcast all the AVX512 routines, plus a couple of minor optimizations (keeping shuffle masks in registers instead of reloading them each time, use VPTERNLOG for 3-way bitwise logic, and in generally trying to use instructions with shorter encoding). Tested only on RocketLake and Linux, passes "make test" and test_checks.sh on my machine.