Open camel-cdr opened 10 months ago
Hi @camel-cdr , Thank you for running the code (I do not have a RVV 1.0 capable hardware to run it) and for your suggestions.
To be honest, I was not expecting the current vector code to be faster than the optimized scalar code so the results you report match my expectation.
I am writing a post for my blog on these experiments. The idea was to try to come up with a baseline vector implementation I could play with. The use of either index loads (vluxei
) or vrgather(ei16)
made me very pessimistic about the performance that could be reached. And as you said using a LMUL=4 vector register groups to store 5 elements is far from efficient. There are many more ideas I would like to try (splitting computation between scalar and vector, using LMUL=8 to back 3 lanes [15 elements], ...) but I am not sure any will pane out.
One suggestion I got from @mjosaarinen was to parallelize two Keccak run and exploit the VLEN=128 to speed-up this with a very similar approach as the unrolled scalar code. I need to check if this corresponds to the KeccakP-1600 times two code you linked above.
Looking at the documentation https://github.com/XKCP/XKCP/blob/master/doc/PlSnP-documentation.h I think your suggestion exactly matches Markku's: Implement parallel processing of multiple Keccak state using vectors.
Thats great, good luck with the project. You can message me if you whant some more benchmarks for the blog post.
@camel-cdr, if you have some cycles available, could you benchmark the vector multi-round implementation: https://github.com/nibrunieAtSi5/rvv-keccak/blob/main/src/keccak-vector-multi-round.c (the default configuration still relying on memory for the rho and pi steps) ?
I'm getting the following:
$ ./vector-multi-round.out
total: 1671419730 byte/cycle: 0.0025094
I'm getting the following:
$ ./vector-multi-round.out total: 1671419730 byte/cycle: 0.0025094
Thank you
Part 1 of my blog post series on SHA-3 is out: https://open.substack.com/pub/fprox/p/rvv-implementation-of-keccak-sha?r=lsukk&utm_campaign=post&utm_medium=web
Thank you again for the hardware measurements @camel-cdr
Oh, I didn't know fprox is your blog, nice I've been following it for a bit.
Oh, I didn't know fprox is your blog, nice I've been following it for a bit.
My secret identity has been revealed :-) Thank you for following.
:-)
It might be worth citing something like "Software implementation of SHA-3 family using AVX2" to explain that not gaining a speedup over scalar, in the non parallel case is expected and not a shortcoming of the isa.
Hi, I'm not sure if you have real hardware at hand, so I wanted to share some benchmark runs on my Kendryte K230 with an rvv 1.0 capable C908 core.
Note that at best, I'd expect the same performance as for the scalar case, because this processor takes roughly two cycles to process one 128 wide vector lane, while it takes about one cycle for a 64-bit GPR operation. (see instruction timings) This is fine when working with bytes, but it's hard to get speedup from 64-bit integers.
For the benchmark I measured the cycle count for
Keccak( 576, 1024, in, 1024*1024*4, 0x06, out, 512);
on random data. (I removed theinstret
measurements from the original code)C908:
./vector-in-reg
performs very bad because the C908 has very slow vrgather for higher LMULs, in this case LMUL=8 SEQ=64 vrgather about 261.1 cycles. From what I can tell, there is no processor where high LMUL gather is particularly fast, although you should be able to expect one cycle per element processed (similar to the X280)../vector.out
is about five time slower thatscalar-opt-in-regs.out
, I think this is mostly to to using needing to use LMUL=4 for only a single element, effectively almost doubling the execution time. Furthermore, the c908 doesn't support zvbb, but the scalar bitmanip extension. vluxei might also be slow, but I haven't benchmarked that yet.I think you'd get the fastest result from porting https://github.com/XKCP/XKCP/blob/master/lib/low/KeccakP-1600-times2/SIMD128/KeccakP-1600-times2-SIMD128.c. It looks like you can just swap out the macros to the equivalent rvv ones, but I'm not sure what exactly is equivalent to your code here.
Obviously a fully scalable implementation would be the best. It might be possible to modify the above to something like that.
If I've got some more time I might also be able to run the code on an much faster C920, but that would require me to backport the assembly manually to rvv 0.7.1, so I'd rather do that after a bit more development.