Closed eric900115 closed 5 months ago
Very exciting! Has any testing been done on real RISCV64 RVV 1.0 hardware?
For testing, we have only tested the code using QEMU and the Spike simulator without real RISC-V RVV 1.0 hardware.
Amazing work! I ran a quick benchmark on the kendryte k230 (thead C908) with this neon mandelbrot code and my handwritten rvv mandelbrot code (slightly adjusted to fit the neon, godbolt link):
rvv LMUL=2: 287907470 cycles
rvv LMUL=1: 419831245 cycles
neon: 536969360 cycles
scalar: 1695304921 cycles
(this was run with 256 iterations and generated a 1440x1080 image)
This is a 3.1x speedup, and close to the hand-optimized rvv LMUL=1
implementation!
rvv LMUL=2
is faster, but we can't really expect SIMDe the vector length. A future avx2 implementation might be able to generate such code. See C910 and C908 for a comparison of rvv implementations.
Edit: give me a minute, the numbers should be roughly correct, but I'm revising the neon code slightly done
Thanks @camel-cdr ! Can you run all the SIMDe tests from this PR on the k230?
Thanks @camel-cdr ! Can you run all the SIMDe tests from this PR on the k230?
I'm currently working on that, however I run into problems with the glibc version on the k230. I used a freestanding build for the benchmark.
I couldn't figure out how to get the glibc versions to align.
To compile SIMDe with support for the conversion from NEON to the RISC-V Vector Extension, please use Clang-17 and include the flag
-mrvv-vector-bits=<vector_length_of_vector_machine>
during compilation.
What's the plan for GCC support?
What about portable binaries, when will we not have to specify -mrvv-vector-bits
?
when will we not have to specify
-mrvv-vector-bits
?
The reason for specifying -mrvv-vector-bits
is due to the limitation that RVV types cannot be included in Neon global structs. For instance, in simde_int64x1_private, trying to declare RVV type vint64m1_t sv64 could result in an error related to the sizeless type issue when compiling. Consider the following example:
typedef union {
SIMDE_ARM_NEON_DECLARE_VECTOR(int64_t, values, 8);
#if defined(SIMDE_X86_MMX_NATIVE)
__m64 m64;
#endif
#if defined(SIMDE_RISCV_V_NATIVE)
vint64m1_t sv64; // This causes a sizeless type issue.
#endif
} simde_int64x1_private;
However, this issue is solved by this llvm patch (https://reviews.llvm.org/D145088) (-mrvv-vector-bits
), which enables the use of fixed-length RVV types.
The scenario is the same as the issue described at: https://github.com/simd-everywhere/simde/issues/914.
In conclusion, the -mrvv-vector-bits
option is necessary for specifying fixed-length RVV types. It must be specified at all times when compiling Neon to RVV.
In conclusion, the
-mrvv-vector-bits
option is necessary for specifying fixed-length RVV types. It must be specified at all times when compiling Neon to RVV.
Can you add some text to the README.md
about using SIMDe on RISC-V?
What's the plan for GCC support?
For supporting GCC, we plan to find a flag that enables the use of fixed-length RVV types, similar to -mrvv-vector-bits
in LLVM.
Can you add some text to the
README.md
about using SIMDe on RISC-V?
Sure !
What about portable binaries, when will we not have to specify
-mrvv-vector-bits
?
Creating portable binaries for RVV (RISC-V Vector Extension) is not feasible, as explained in the discussion at https://news.ycombinator.com/item?id=37706070. To summarize, the vector size in RVV is determined at compile time, making it impossible to create binaries that can be ported seamlessly between RVV machines with different vector lengths.
Would a binary for a smaller vector size work on a CPU with a larger vector size?
Maybe a future RISC-V profile will mandate a larger vector size.
I guess for Debian and others that want to maximize the performance of SIMDe using apps, we will have to compile multiple times based on the vector widths that are commercially available. Which is what we already do to support the various x86-64 SIMD intrinsics (https://wiki.debian.org/SIMDEverywhere and https://packages.debian.org/source/testing/subarch-select)
@eric900115 For neon RVV codegen can be 100% portable. If we require the standard V extension (VLEN>=128 and ELEN=64), then we can use LMUL=1 on all implementations, because even for e.g. LMUL=512 a single vector registed does alteast contain 128 bits. We just need to vsetivli
to the fixed element count properly.
I'm not sure if -mrvv-vector-bits=128
guarantees to work with VLEN>=128, but it would certainly be possible.
I played arround with using fixed element count load/stores to implement something like this, when the fixed width support didn't exist, but at the time compilers couldn't do the load/store elimination, so it was kindof useless.
#include <riscv_vector.h>
#include <stddef.h>
#include <stdint.h>
typedef struct { uint8_t arr[16]; } V128;
static
V128 vadd8(V128 a, V128 b)
{
vuint8m1_t A = __riscv_vle8_v_u8m1((void*)&a,16);
vuint8m1_t B = __riscv_vle8_v_u8m1((void*)&b,16);
vuint8m1_t C = __riscv_vadd_vv_u8m1(A, B, 16);
V128 c;
__riscv_vse8_v_u8m1((void*)&c, C, 16);
return c;
}
V128 test1(V128 a, V128 b, V128 c)
{
return vadd8(vadd8(a, b), vadd8(c, c));
}
V128 test2(V128 a, V128 b, V128 c)
{
vuint8m1_t A = __riscv_vle8_v_u8m1((void*)&a,16);
vuint8m1_t B = __riscv_vle8_v_u8m1((void*)&b,16);
vuint8m1_t C = __riscv_vle8_v_u8m1((void*)&c,16);
V128 r;
__riscv_vse8_v_u8m1((void*)&r, __riscv_vadd_vv_u8m1(__riscv_vadd_vv_u8m1(A, B, 16), __riscv_vadd_vv_u8m1(C, C, 16), 16), 16);
return r;
}
I couldn't figure out how to get the glibc versions to align.
Here's my meson setup --cross ...
config for using https://packages.debian.org/unstable/clang-18 and running on the official Debian image:
[binaries]
c = 'clang-18'
cpp = 'clang++-18'
ar = 'llvm-ar-18'
strip = 'llvm-strip-18'
objcopy = 'llvm-objcopy-18'
ld = 'riscv64-linux-gnu-ld'
[properties]
c_args = ['--target=riscv64-linux-gnu', '-isystem=/usr/riscv64-linux-gnu/include', '-Wextra', '-Werror', '-march=rv64imafdcv_zihintpause_zfh_zba_zbb_zbc_zbs_zicsr_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b', '-O3', '-mrvv-vector-bits=128']
cpp_args = ['--target=riscv64-linux-gnu', '-isystem=/usr/riscv64-linux-gnu/include', '-Wextra', '-Werror', '-march=rv64imafdcv_zihintpause_zfh_zba_zbb_zbc_zbs_zicsr_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b', '-O3', '-mrvv-vector-bits=128']
c_link_args = ['--target=riscv64-linux-gnu', '-static', '-static-libgcc']
cpp_link_args = ['--target=riscv64-linux-gnu', '-static', '-static-libgcc', '-static-libstdc++']
[host_machine]
system = 'linux'
cpu_family = 'riscv64'
cpu = 'thead-c906'
endian = 'little'
Here's my
meson setup --cross ...
config for using https://packages.debian.org/unstable/clang-18 and running on the official Debian image:
Thanks it worked. I didn't know about the debian image, and was using the k230_sdk thingy.
Running for i in *native*; do ./$i; done
in the arm/neon
directory results in the following errors:
../test/arm/neon/fma_lane.c:1163: assertion failed: r1[0] ~= simde_vld1q_f32(test_vec[i].r1)[0] (-382857.250000 ~= 503169.843750)
test/arm/neon/fma_lane.cpp:1163: assertion failed: r1[0] ~= simde_vld1q_f32(test_vec[i].r1)[0] (-382857.250000 ~= 503169.843750)
../test/arm/neon/fms_lane.c:873: assertion failed: r1[0] ~= simde_vld1q_f32(test_vec[i].r1)[0] (-506717.906250 ~= 554470.187500)
test/arm/neon/fms_lane.cpp:873: assertion failed: r1[0] ~= simde_vld1q_f32(test_vec[i].r1)[0] (-506717.906250 ~= 554470.187500)
../test/arm/neon/mul_lane.c:865: assertion failed: r[0] ~= simde_vld1q_f32(test_vec[i].r)[0] (132874.984375 ~= 347499.312500)
test/arm/neon/mul_lane.cpp:865: assertion failed: r[0] ~= simde_vld1q_f32(test_vec[i].r)[0] (132874.984375 ~= 347499.312500)
../test/arm/neon/mulx_lane.c:315: assertion failed: r[0] ~= simde_vld1q_f32(test_vec[i].r)[0] (132874.984375 ~= 347499.312500)
test/arm/neon/mulx_lane.cpp:315: assertion failed: r[0] ~= simde_vld1q_f32(test_vec[i].r)[0] (132874.984375 ~= 347499.312500)
../test/arm/neon/qrdmlah.c:193: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (17480 == 3378)
test/arm/neon/qrdmlah.cpp:193: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (17480 == 3378)
../test/arm/neon/qrdmlah_lane.c:475: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (-18972 == -13752)
test/arm/neon/qrdmlah_lane.cpp:475: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (-18972 == -13752)
../test/arm/neon/qrdmlsh.c:197: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (9556 == -32768)
test/arm/neon/qrdmlsh.cpp:197: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (9556 == -32768)
../test/arm/neon/qrdmlsh_lane.c:475: assertion failed: r[2] == simde_vld1_s16(test_vec[i].r)[2] (30372 == -32768)
test/arm/neon/qrdmlsh_lane.cpp:475: assertion failed: r[2] == simde_vld1_s16(test_vec[i].r)[2] (30372 == -32768)
../test/arm/neon/qrdmulh_lane.c:264: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r0)[0] (0 == 14610)
test/arm/neon/qrdmulh_lane.cpp:264: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r0)[0] (0 == 14610)
../test/arm/neon/uqadd.c:339: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (181 == 32767)
test/arm/neon/uqadd.cpp:339: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (181 == 32767)
@camel-cdr Thanks!
Yeah, I'm now also seeing those errors.
I wonder why the qemu setup in this PR isn't reproducing them?
I hope it isn't a hardware error! :-)
Also, hello @eric900115 and @camel-cdr from the Debian Med Sprint in Berlin. Maybe you can join us in person next year? https://wiki.debian.org/Sprints/2023/DebianMed2024
Sounds interesting, Berlin is only 2-3 hours away from me. But I'm not really involved with Debian (except for running it).
Btw, do you know how Debian deals with compiler bugs? I just ran into an gcc-13.2.0 codegen bug, that causes a valid program to not work vsetvli a5,a1,e8,m8,ta,ma
should be vsetvli a5,a1,e8,m8,tu,ma
. It's been fixed on trunk, but is this a thing that would be back-ported?
Btw, do you know how Debian deals with compiler bugs? I just ran into an gcc-13.2.0 codegen bug, that causes a valid program to not work
vsetvli a5,a1,e8,m8,ta,ma
should bevsetvli a5,a1,e8,m8,tu,ma
. It's been fixed on trunk, but is this a thing that would be back-ported?
I would personally respond positively to a reportbug gcc-13
with a link to the upstream fix, but I don't know that team so I can't make promises.
Sounds interesting, Berlin is only 2-3 hours away from me. But I'm not really involved with Debian (except for running it).
Anyone is welcome! We appreciate the user perspective!
@camel-cdr Thanks!
Yeah, I'm now also seeing those errors.
I wonder why the qemu setup in this PR isn't reproducing them?
I hope it isn't a hardware error! :-)
I am also wondering. I'll try to use qemu with same configuration for testing (testing with thread-c906 CPU).
@camel-cdr Do you also get failures on the k230 with the current master
branch of SIMDe?
I'm seeing failures in
arm/neon/{q,}abs/{emul,native}/{c,cpp}
arm/neon/qrdml{a,s}h_lane/{emul,native}/{c,cpp}
arm/neon/st3/{emul,native}/{c,cpp}
So I guess there are some clang and/or CPU errors .. ?
@mr-c Yes, I get similar errors when testing master
:
../test/arm/neon/abs.c:711: assertion failed: r[0] == simde_vld1q_s32(test_vec[i].r)[0] (0 == -2147483648)
test/arm/neon/abs.cpp:711: assertion failed: r[0] == simde_vld1q_s32(test_vec[i].r)[0] (0 == -2147483648)
timeout qabs-native-c
timeout qabs-native-cpp
../test/arm/neon/qrdmlah_lane.c:475: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (-18972 == -13752)
../test/arm/neon/qrdmlah_lane.c:678: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (26528 == 18592)
../test/arm/neon/qrdmlah_lane.c:901: assertion failed: r[0] == simde_vld1q_s16(test_vec[i].r)[0] (-1308 == 32767)
../test/arm/neon/qrdmlah_lane.c:1128: assertion failed: r[0] == simde_vld1q_s16(test_vec[i].r)[0] (-13600 == 25250)
test/arm/neon/qrdmlah_lane.cpp:475: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (-18972 == -13752)
test/arm/neon/qrdmlah_lane.cpp:678: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (26528 == 18592)
test/arm/neon/qrdmlah_lane.cpp:901: assertion failed: r[0] == simde_vld1q_s16(test_vec[i].r)[0] (-1308 == 32767)
test/arm/neon/qrdmlah_lane.cpp:1128: assertion failed: r[0] == simde_vld1q_s16(test_vec[i].r)[0] (-13600 == 25250)
../test/arm/neon/qrdmlsh_lane.c:475: assertion failed: r[2] == simde_vld1_s16(test_vec[i].r)[2] (30372 == -32768)
../test/arm/neon/qrdmlsh_lane.c:1128: assertion failed: r[3] == simde_vld1q_s16(test_vec[i].r)[3] (26847 == -32768)
test/arm/neon/qrdmlsh_lane.cpp:475: assertion failed: r[2] == simde_vld1_s16(test_vec[i].r)[2] (30372 == -32768)
test/arm/neon/qrdmlsh_lane.cpp:1128: assertion failed: r[3] == simde_vld1q_s16(test_vec[i].r)[3] (26847 == -32768)
I'm somewhat inclined to believe it's a clang miss-compilation, because I had a gcc-13.2 miss-compilation yesterday, I suppose we need to investigate this somehow.
Hey @camel-cdr ; in #1141 I fixed some of the NEON abs functions. Maybe you have time to re-run the tests?
@mr-c Here we go, looks like the abs errors are gone, great work.
../test/arm/neon/qrdmlah_lane.c:475: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (-18972 == -13752)
../test/arm/neon/qrdmlah_lane.c:678: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (26528 == 18592)
../test/arm/neon/qrdmlah_lane.c:901: assertion failed: r[0] == simde_vld1q_s16(test_vec[i].r)[0] (-1308 == 32767)
../test/arm/neon/qrdmlah_lane.c:1128: assertion failed: r[0] == simde_vld1q_s16(test_vec[i].r)[0] (-13600 == 25250)
test/arm/neon/qrdmlah_lane.cpp:475: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (-18972 == -13752)
test/arm/neon/qrdmlah_lane.cpp:678: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (26528 == 18592)
test/arm/neon/qrdmlah_lane.cpp:901: assertion failed: r[0] == simde_vld1q_s16(test_vec[i].r)[0] (-1308 == 32767)
test/arm/neon/qrdmlah_lane.cpp:1128: assertion failed: r[0] == simde_vld1q_s16(test_vec[i].r)[0] (-13600 == 25250)
../test/arm/neon/qrdmlsh_lane.c:475: assertion failed: r[2] == simde_vld1_s16(test_vec[i].r)[2] (30372 == -32768)
../test/arm/neon/qrdmlsh_lane.c:1128: assertion failed: r[3] == simde_vld1q_s16(test_vec[i].r)[3] (26847 == -32768)
test/arm/neon/qrdmlsh_lane.cpp:475: assertion failed: r[2] == simde_vld1_s16(test_vec[i].r)[2] (30372 == -32768)
test/arm/neon/qrdmlsh_lane.cpp:1128: assertion failed: r[3] == simde_vld1q_s16(test_vec[i].r)[3] (26847 == -32768)
I have modified mul_lane and mulx_lane. Hope the error in fms_lane, fma_lane, mul_lane, and mulx_lane will be eliminated.
Hello @eric900115, this is really good stuff. I have a question, you mentioned you converted all Neon intrinsics to RVV, does that exclude bf16 and cryptography instructions which may not be easily replicated with base V? I think trunk LLVM contains experimental intrinsics for Zvfbfwma and Vector crypto.
Is there anything you might need help with?
Thanks @camel-cdr ; can you retest the latest?
@mr-c the errors are still there, but the values are different now:
../test/arm/neon/qrdmlah.c:193: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (17480 == 3378)
test/arm/neon/qrdmlah.cpp:193: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (17480 == 3378)
../test/arm/neon/qrdmlah_lane.c:475: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (-18972 == -13752)
test/arm/neon/qrdmlah_lane.cpp:475: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (-18972 == -13752)
../test/arm/neon/qrdmlsh.c:197: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (9556 == -32768)
test/arm/neon/qrdmlsh.cpp:197: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (9556 == -32768)
../test/arm/neon/qrdmlsh_lane.c:475: assertion failed: r[2] == simde_vld1_s16(test_vec[i].r)[2] (30372 == -32768)
test/arm/neon/qrdmlsh_lane.cpp:475: assertion failed: r[2] == simde_vld1_s16(test_vec[i].r)[2] (30372 == -32768)
../test/arm/neon/qrdmulh_lane.c:264: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r0)[0] (0 == 14610)
test/arm/neon/qrdmulh_lane.cpp:264: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r0)[0] (0 == 14610)
../test/arm/neon/uqadd.c:339: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (181 == 32767)
test/arm/neon/uqadd.cpp:339: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (181 == 32767)
@camel-cdr are those errors from the emul
or native
tests?
@mr-c it was the native tests, I ran it via: for i in *native*; do ./$i; done > /dev/null
@OMaghiarIMG
Hi! Yes, we excluded BF16 and cryptography for conversion.
For the conversion from NEON to RVV, if the performance (instruction counts) of using single or multiple RVV intrinsics is better than automatic vectorization, then we use RVV intrinsics for implementation. Otherwise, we use loop automatic vectorization from SIMDe.
Thank you @eric900115 ! Now that SIMDe 0.8.0 is released we can focus the next development cycle on RVV 1.0 implementations.
Hi everyone,
This is Eric from National Tsing Hua University (NTHU) pllab. This PR includes the initialization of the conversion of Neon to RISC-V Vector Extension (RVV) for SIMDe.
NTHU pllab and Andes Technology have collaborated to convert NEON intrinsics to the RISC-V Vector Extension, and we have converted all NEON intrinsics to RVV intrinsics. This PR marks the beginning of our work. We will soon upstream all of our work.
We made a few changes in the SIMDe repo to suit our needs:
type.h
to enable RVV types.memcpy
due to memory pollution issues in our implementation.add.h
andmul.h
.We have included clang-qemu-rvv testing for the following RISC-V V Extension architectures, both with and without ZVFH enabled:
vlen = 128 & elen = 64
vlen = 256 & elen = 64
vlen = 512 & elen = 64
To compile SIMDe with support for the conversion from NEON to the RISC-V Vector Extension, please use Clang-17 and include the flag with the actual vector length of RISC-V vector machine.
-mrvv-vector-bits=<vector_length_of_vector_machine>
during compilation. Replace