Closed Shark64 closed 8 months ago
Closing this PR, as there is no further activity. @Shark64, feel free to reopen it if you want to keep going.
Ok, should i make a new pull request with only the broadcast for constant or i can keep the minor optimizations like using a base register instead of RIP addressing to make instructions shorter? Thanks!
I suggest different PRs or a single PR but with multiple commits (especially if there are dependencies)
Hi. I've tried to make a minimal version of my patch for using the AVX512 embedded broadcast feature. I've kept only the embedded broadcast, using SIMD instructions to update the data pointers instead of the unrolled scalar code and loop alignment to maximize uop-cache utilization. On SM3_MB i've also switched 2 macros to use VPTERNLOG instead of a sequence of separate logical instructions. passes `make tests' on my PC (Linux and Rocketlake CPU). Sorry for the new pull request, but i haven't found a way to edit the other pull-request under git to only apply parts of my changes.