Right now this is a hybrid 4x Keccak (2 scalar, 2 Neon). I de-interleaved the previous manual-interleaved code and optimized it via SLOTHY. There is still a lot of potential for refactoring.
The 6624 is already quite a bit faster than the 7288 reported in https://kannwischer.eu/papers/2022_armv8keccak.pdf
This is still slower than the 1x scalar one in the same paper which was 1418; 1418*4=5672)
WIP adding Keccak via SLOTHY.
Right now this is a hybrid 4x Keccak (2 scalar, 2 Neon). I de-interleaved the previous manual-interleaved code and optimized it via SLOTHY. There is still a lot of potential for refactoring.
In the current state (https://github.com/slothy-optimizer/pqax/commit/c69030c65e205fc585026265bcc492da41f85024), the results look as follow:
For reference:
The 6624 is already quite a bit faster than the 7288 reported in https://kannwischer.eu/papers/2022_armv8keccak.pdf This is still slower than the 1x scalar one in the same paper which was 1418; 1418*4=5672)
Related to https://github.com/slothy-optimizer/pqax/pull/6