mr-c commented 1 year ago

(I don't plan on doing this myself, but I wanted to start the conversation to see who is interested in doing this)

What

Use RISC-V vector intrinsics to provide optimized implementations of the existing intrinsics (X86, ARM Neon, MIPS MSA, WASM, etc.) already in SIMD Everywhere.

Existing work

SSE: https://github.com/FeddrickAquino/sse2rvv (assumes VLEN of 128bits).
NEON: https://doi.org/10.48550/arXiv.2309.16509 (requires clang 17 ; SIMDe source code pending)
NEON: https://github.com/howjmay/neon2rvv

When to start

The vector extensions themselves were ratified in 2021. The intrinsics for using them from C/C++ are nearly ratified (see below), therefore we can start accepting contributions now.

Hopefully we will have a ratified specification by the end of this year.

(source)

Takes 45 days as the public review period (till mid-December)

(source)

Recent draft: https://github.com/riscv-non-isa/rvv-intrinsic-doc/releases/download/draft-20231014-c10de5388709b000ecc4becb0d9ee16baa0141a9/v-intrinsic-spec.pdf (latest drafts)

https://github.com/riscv-non-isa/rvv-intrinsic-doc

Which compilers to test?

Upcoming LLVM 17 and GCC trunk supports v0.12, which is expected to be identical to the to-be-frozen intrinsic specification.

Clang 16 and GCC 13 supports the v0.11 version, which does not have tuple type segment load/store intrinsics, fixed-point intrinsics with rounding mode parameter, and floating-point intrinsics with rounding mdoe parameter.

Benchmarking

Maybe autovectorization is good enough. Hand written implementations should both be compared by the number of instructions and on real-world performance.

Please share any suggestions for publicly available RISC-V Vector 1.0 systems.

https://riscv.org/risc-v-developer-boards/details/

https://www.riscfive.com/risc-v-development-boards/ lists some boards with the V extension, but I can't find a public declaration that any of them follow the 1.0 version of the vector extension.

According to https://doi.org/10.48550/arXiv.2210.08882 , the following cores implement v1.0 of the RISC-V Vector Extension: SiFive X280, Andes NX27V, Atrevido 220. Notably for the riscfive.com list of dev boards, the XuanTie 910 core is RVV version 0.7.1.

camel-cdr commented 1 year ago

I'm building a list of rvv benchmark results, which could be useful for this project, it currently has numbers for C906 and C910/C920: https://camel-cdr.github.io/rvv-bench-results/

A few performance notes on other processors:

tenstorrent's bobcat is an opensource rvv 1.0 vector unit that integrates into BOOM. It's supposed to be tenstorrents proof of concept rvv implementation and it doesn't support a divide and sqrt instructions. You run smaller benchmarks by simulating the verilog. I wasn't able to run my full benchmark yet, because it's quite slow and times out after long runs using verilator.
ara opensource rvv 1.0 vector unit for the CVA6 core. It doesn't support the complex permute instructions yet, although there are open PRs. It can also be simulated, but seems to have a bunch of problems when simulating with verilator.
x280: I don't have any access to the hardware, but llvm mca has a performance model for this specific processor. I don't know how accurate it is to the real thing, but you can check it out here. It has a VLEN of 512, and llvm-mca reports 2 cycles for most LMUL=1 operations, reductions and compress/gather are very slow though:
- vcompress.vm/vrgather.vv e8,m1: 64 cycles, e8,m8: 512 cycles
- Reductions are similarly slow: m1: 32-47 cycles, m8: 60 cycles
C908: supports rvv 1.0 and is supposed to be between C906 and C910/C920 in performance. I'll hopefully get mine next week.

I think the biggest problem with directly porting fixed size SIMD to rvv is how to choose the LMUL.

For SSE and neon the answer is simple, choose LMUL=1, because the V extension, as required by the application profile, requires a VLEN of at least 128 bits.

For avx2 and avx512, it's less trivial. If you always choose LMUL=2 for avx2, then the code will run on all CPUs that implement V, but may waste half of the available registers if the VLEN>=256.

It may also run at half the possible speed (see bobcat), because currently most processors (C9*, ocelot, probably any with VLEN=128) dispatch based on LMUL, and no based on the set vl, so a vl=1 operation would be slower with LMUL=2 then with LMUL=1.

Ara on the other hand, dispatch based on the set vl, so vl=1 LMUL=1 is as fast as a vl=1 LMUL=2 operation. This is needed in ara, because it has a very large VLEN of 4096 by default, and I suspect most implementations with large VLEN will eventually do something similar. In ara this even works with fractional LMUL, here are measurements for unrolled 8 bit adds:

vl:  cycles/instruction
8:   3
16:  3
32:  3
64:  3  // vlmax for e8, mf8
128: 4  // vlmax for e8, mf4
256: 8  // vlmax for e8, mf2
512: 16 // vlmax for e8, m1

The question is whether to calculate the best LMUL at startup, or add the fix VLEN as a configurable parameter, or add the minimum supported VLEN as a configurable parameter.

mr-c commented 1 year ago

Thank you very much @camel-cdr for sharing your benchmarking project! I'm also getting the same C908 dev board you are waiting on; and likewise it is also expected next week :-)

mr-c commented 1 year ago

For others looking to contribute, there is an open application to receive a Kendryte K230 developer board with RVV 1.0 from Canaan

https://docs.google.com/forms/d/e/1FAIpQLSeZ6GBvZynKFm4w7ZRdI_NRyzgVcr4NSxuPZNLZ8__K9Y2WbA/viewform (background information)

I'm happy to help you with your application, please contact me directly for that.

camel-cdr commented 1 year ago

Just dropping a few more reference here:

neon2rvv: https://github.com/howjmay/neon2rvv

if somebody plans to port vzip with rvv intrinsics: https://github.com/riscv-non-isa/rvv-intrinsic-doc/issues/289#issuecomment-1781385001

camel-cdr commented 1 year ago

The risc-v summit talk about the rvv simde paper is online: https://www.youtube.com/watch?v=puvnghbIAV4

simd-everywhere / simde

add optimized implementations using RISC-V vector intrinsics #1087

What

Existing work

When to start

Which compilers to test?

Benchmarking