Closed animetosho closed 2 years ago
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).
View this failed invocation of the CLA check for more information.
For the most up to date status, view the checks section at the bottom of the pull request.
Looks good to me! I wonder if at this point we shouldn't be using a vtable-like approach and do dynamic dispatch depending on the cpu arch...
(as a side note, I optimized this on a Threadripper CPU, and you certainly can feel the slowness of BMI2 there...)
Dynamic CPU dispatch would indeed be nice. It's mostly SSE vs AVX+BMI2, with the PEXT exclusion being specific to some AMD CPUs.
BMI2 is generally fine on AMD CPUs - it's just that PDEP/PEXT instructions were microcoded before Zen3, so you need to be careful with those. BMI2 does provide better variable shifts, though AMD CPUs have generally been good at that, even without BMI2.
Second batch of optimizations. These shouldn't affect the output in any way.
Most of this is an implementation of WriteBits using the
PEXT
instruction, which is disabled on CPUs with a slow implementation of it (AMD CPUs before Zen3).Comparison on a 12700K:
CLA response: I release these changes to the public domain subject to the CC0 license (https://creativecommons.org/publicdomain/zero/1.0/).