Open mr-c opened 1 year ago
I'm building a list of rvv benchmark results, which could be useful for this project, it currently has numbers for C906 and C910/C920: https://camel-cdr.github.io/rvv-bench-results/
A few performance notes on other processors:
tenstorrent's bobcat is an opensource rvv 1.0 vector unit that integrates into BOOM. It's supposed to be tenstorrents proof of concept rvv implementation and it doesn't support a divide and sqrt instructions. You run smaller benchmarks by simulating the verilog. I wasn't able to run my full benchmark yet, because it's quite slow and times out after long runs using verilator.
ara opensource rvv 1.0 vector unit for the CVA6 core. It doesn't support the complex permute instructions yet, although there are open PRs. It can also be simulated, but seems to have a bunch of problems when simulating with verilator.
x280: I don't have any access to the hardware, but llvm mca has a performance model for this specific processor. I don't know how accurate it is to the real thing, but you can check it out here. It has a VLEN of 512, and llvm-mca reports 2 cycles for most LMUL=1 operations, reductions and compress/gather are very slow though:
C908: supports rvv 1.0 and is supposed to be between C906 and C910/C920 in performance. I'll hopefully get mine next week.
I think the biggest problem with directly porting fixed size SIMD to rvv is how to choose the LMUL.
For SSE and neon the answer is simple, choose LMUL=1, because the V extension, as required by the application profile, requires a VLEN of at least 128 bits.
For avx2 and avx512, it's less trivial. If you always choose LMUL=2 for avx2, then the code will run on all CPUs that implement V, but may waste half of the available registers if the VLEN>=256.
It may also run at half the possible speed (see bobcat), because currently most processors (C9*, ocelot, probably any with VLEN=128) dispatch based on LMUL, and no based on the set vl, so a vl=1 operation would be slower with LMUL=2 then with LMUL=1.
Ara on the other hand, dispatch based on the set vl, so vl=1 LMUL=1 is as fast as a vl=1 LMUL=2 operation. This is needed in ara, because it has a very large VLEN of 4096 by default, and I suspect most implementations with large VLEN will eventually do something similar. In ara this even works with fractional LMUL, here are measurements for unrolled 8 bit adds:
vl: cycles/instruction
8: 3
16: 3
32: 3
64: 3 // vlmax for e8, mf8
128: 4 // vlmax for e8, mf4
256: 8 // vlmax for e8, mf2
512: 16 // vlmax for e8, m1
The question is whether to calculate the best LMUL at startup, or add the fix VLEN as a configurable parameter, or add the minimum supported VLEN as a configurable parameter.
Thank you very much @camel-cdr for sharing your benchmarking project! I'm also getting the same C908 dev board you are waiting on; and likewise it is also expected next week :-)
For others looking to contribute, there is an open application to receive a Kendryte K230 developer board with RVV 1.0 from Canaan
https://docs.google.com/forms/d/e/1FAIpQLSeZ6GBvZynKFm4w7ZRdI_NRyzgVcr4NSxuPZNLZ8__K9Y2WbA/viewform (background information)
I'm happy to help you with your application, please contact me directly for that.
Just dropping a few more reference here:
neon2rvv: https://github.com/howjmay/neon2rvv
if somebody plans to port vzip with rvv intrinsics: https://github.com/riscv-non-isa/rvv-intrinsic-doc/issues/289#issuecomment-1781385001
The risc-v summit talk about the rvv simde paper is online: https://www.youtube.com/watch?v=puvnghbIAV4
(I don't plan on doing this myself, but I wanted to start the conversation to see who is interested in doing this)
What
Use RISC-V vector intrinsics to provide optimized implementations of the existing intrinsics (X86, ARM Neon, MIPS MSA, WASM, etc.) already in SIMD Everywhere.
Existing work
VLEN
of 128bits).When to start
The vector extensions themselves were ratified in 2021. The intrinsics for using them from C/C++ are nearly ratified (see below), therefore we can start accepting contributions now.
(source)
(source)
Recent draft: https://github.com/riscv-non-isa/rvv-intrinsic-doc/releases/download/draft-20231014-c10de5388709b000ecc4becb0d9ee16baa0141a9/v-intrinsic-spec.pdf (latest drafts)
https://github.com/riscv-non-isa/rvv-intrinsic-doc
Which compilers to test?
Benchmarking
Maybe autovectorization is good enough. Hand written implementations should both be compared by the number of instructions and on real-world performance.
Please share any suggestions for publicly available RISC-V Vector 1.0 systems.
https://riscv.org/risc-v-developer-boards/details/
https://www.riscfive.com/risc-v-development-boards/ lists some boards with the
V
extension, but I can't find a public declaration that any of them follow the 1.0 version of the vector extension.According to https://doi.org/10.48550/arXiv.2210.08882 , the following cores implement v1.0 of the RISC-V Vector Extension: SiFive X280, Andes NX27V, Atrevido 220. Notably for the riscfive.com list of dev boards, the XuanTie 910 core is RVV version 0.7.1.