Initial Support for the RISC-V Vector Extension in ARM NEON

eric900115 commented 6 months ago

Hi everyone,

This is Eric from National Tsing Hua University (NTHU) pllab. This PR includes the initialization of the conversion of Neon to RISC-V Vector Extension (RVV) for SIMDe.

NTHU pllab and Andes Technology have collaborated to convert NEON intrinsics to the RISC-V Vector Extension, and we have converted all NEON intrinsics to RVV intrinsics. This PR marks the beginning of our work. We will soon upstream all of our work.

We made a few changes in the SIMDe repo to suit our needs:

We modified type.h to enable RVV types.
We modified files related to memcpy due to memory pollution issues in our implementation.
We modified files related to memory load and store for RVV implementation.
The initial conversion from Neon to RVV can be viewed in add.h and mul.h.
For CI, we've added clang-qemu-rvv in the CI.yml file for RVV testing. The testing runs on a Docker container with Ubuntu 23.10 because we need QEMU 8.0 (which is not available in Ubuntu 22.04). Later, when Ubuntu 24.04 is released on GitHub Actions, testing can be hosted on the Ubuntu 24.04 GitHub Actions machine.

We have included clang-qemu-rvv testing for the following RISC-V V Extension architectures, both with and without ZVFH enabled:

vlen = 128 & elen = 64
vlen = 256 & elen = 64
vlen = 512 & elen = 64

To compile SIMDe with support for the conversion from NEON to the RISC-V Vector Extension, please use Clang-17 and include the flag -mrvv-vector-bits=<vector_length_of_vector_machine> during compilation. Replace with the actual vector length of RISC-V vector machine.

eric900115 commented 6 months ago

Very exciting! Has any testing been done on real RISCV64 RVV 1.0 hardware?

For testing, we have only tested the code using QEMU and the Spike simulator without real RISC-V RVV 1.0 hardware.

camel-cdr commented 6 months ago

Amazing work! I ran a quick benchmark on the kendryte k230 (thead C908) with this neon mandelbrot code and my handwritten rvv mandelbrot code (slightly adjusted to fit the neon, godbolt link):

rvv LMUL=2:  287907470 cycles
rvv LMUL=1:  419831245 cycles
neon:        536969360 cycles
scalar:     1695304921 cycles

(this was run with 256 iterations and generated a 1440x1080 image)

This is a 3.1x speedup, and close to the hand-optimized rvv LMUL=1 implementation!

rvv LMUL=2 is faster, but we can't really expect SIMDe the vector length. A future avx2 implementation might be able to generate such code. See C910 and C908 for a comparison of rvv implementations.

Edit: ~~give me a minute, the numbers should be roughly correct, but I'm revising the neon code slightly~~ done

mr-c commented 6 months ago

Thanks @camel-cdr ! Can you run all the SIMDe tests from this PR on the k230?

camel-cdr commented 6 months ago

Thanks @camel-cdr ! Can you run all the SIMDe tests from this PR on the k230?

I'm currently working on that, however I run into problems with the glibc version on the k230. I used a freestanding build for the benchmark.

camel-cdr commented 6 months ago

I couldn't figure out how to get the glibc versions to align.

mr-c commented 6 months ago

To compile SIMDe with support for the conversion from NEON to the RISC-V Vector Extension, please use Clang-17 and include the flag -mrvv-vector-bits=<vector_length_of_vector_machine> during compilation.

What's the plan for GCC support?

What about portable binaries, when will we not have to specify -mrvv-vector-bits?

eric900115 commented 6 months ago

when will we not have to specify -mrvv-vector-bits?

The reason for specifying -mrvv-vector-bits is due to the limitation that RVV types cannot be included in Neon global structs. For instance, in simde_int64x1_private, trying to declare RVV type vint64m1_t sv64 could result in an error related to the sizeless type issue when compiling. Consider the following example:

typedef union {
  SIMDE_ARM_NEON_DECLARE_VECTOR(int64_t, values, 8);

  #if defined(SIMDE_X86_MMX_NATIVE)
    __m64 m64;
  #endif

  #if defined(SIMDE_RISCV_V_NATIVE)
    vint64m1_t sv64; // This causes a sizeless type issue.
  #endif

} simde_int64x1_private;

However, this issue is solved by this llvm patch (https://reviews.llvm.org/D145088) (-mrvv-vector-bits), which enables the use of fixed-length RVV types.

The scenario is the same as the issue described at: https://github.com/simd-everywhere/simde/issues/914.

In conclusion, the -mrvv-vector-bits option is necessary for specifying fixed-length RVV types. It must be specified at all times when compiling Neon to RVV.

mr-c commented 6 months ago

In conclusion, the -mrvv-vector-bits option is necessary for specifying fixed-length RVV types. It must be specified at all times when compiling Neon to RVV.

Can you add some text to the README.md about using SIMDe on RISC-V?

eric900115 commented 6 months ago

What's the plan for GCC support?

For supporting GCC, we plan to find a flag that enables the use of fixed-length RVV types, similar to -mrvv-vector-bits in LLVM.

eric900115 commented 6 months ago

Can you add some text to the README.md about using SIMDe on RISC-V?

Sure !

eric900115 commented 6 months ago

What about portable binaries, when will we not have to specify -mrvv-vector-bits?

Creating portable binaries for RVV (RISC-V Vector Extension) is not feasible, as explained in the discussion at https://news.ycombinator.com/item?id=37706070. To summarize, the vector size in RVV is determined at compile time, making it impossible to create binaries that can be ported seamlessly between RVV machines with different vector lengths.

mr-c commented 6 months ago

Would a binary for a smaller vector size work on a CPU with a larger vector size?

Maybe a future RISC-V profile will mandate a larger vector size.

I guess for Debian and others that want to maximize the performance of SIMDe using apps, we will have to compile multiple times based on the vector widths that are commercially available. Which is what we already do to support the various x86-64 SIMD intrinsics (https://wiki.debian.org/SIMDEverywhere and https://packages.debian.org/source/testing/subarch-select)

camel-cdr commented 6 months ago

@eric900115 For neon RVV codegen can be 100% portable. If we require the standard V extension (VLEN>=128 and ELEN=64), then we can use LMUL=1 on all implementations, because even for e.g. LMUL=512 a single vector registed does alteast contain 128 bits. We just need to vsetivli to the fixed element count properly. I'm not sure if -mrvv-vector-bits=128 guarantees to work with VLEN>=128, but it would certainly be possible. I played arround with using fixed element count load/stores to implement something like this, when the fixed width support didn't exist, but at the time compilers couldn't do the load/store elimination, so it was kindof useless.

#include <riscv_vector.h>
#include <stddef.h>
#include <stdint.h>

typedef struct { uint8_t arr[16]; } V128;

static
V128 vadd8(V128 a, V128 b)
{
    vuint8m1_t A = __riscv_vle8_v_u8m1((void*)&a,16);
    vuint8m1_t B = __riscv_vle8_v_u8m1((void*)&b,16);
    vuint8m1_t C = __riscv_vadd_vv_u8m1(A, B, 16);
    V128 c;
    __riscv_vse8_v_u8m1((void*)&c, C, 16);
   return c;
}

V128 test1(V128 a, V128 b, V128 c)
{
    return vadd8(vadd8(a, b), vadd8(c, c));
}

V128 test2(V128 a, V128 b, V128 c)
{
    vuint8m1_t A = __riscv_vle8_v_u8m1((void*)&a,16);
    vuint8m1_t B = __riscv_vle8_v_u8m1((void*)&b,16);
    vuint8m1_t C = __riscv_vle8_v_u8m1((void*)&c,16);
    V128 r;
    __riscv_vse8_v_u8m1((void*)&r, __riscv_vadd_vv_u8m1(__riscv_vadd_vv_u8m1(A, B, 16), __riscv_vadd_vv_u8m1(C, C, 16), 16), 16);
    return r;
}

mr-c commented 6 months ago

I couldn't figure out how to get the glibc versions to align.

Here's my meson setup --cross ... config for using https://packages.debian.org/unstable/clang-18 and running on the official Debian image:

[binaries]
c = 'clang-18'
cpp = 'clang++-18'
ar = 'llvm-ar-18'
strip = 'llvm-strip-18'
objcopy = 'llvm-objcopy-18'
ld = 'riscv64-linux-gnu-ld'

[properties]
c_args   = ['--target=riscv64-linux-gnu', '-isystem=/usr/riscv64-linux-gnu/include', '-Wextra', '-Werror', '-march=rv64imafdcv_zihintpause_zfh_zba_zbb_zbc_zbs_zicsr_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b', '-O3', '-mrvv-vector-bits=128']
cpp_args = ['--target=riscv64-linux-gnu', '-isystem=/usr/riscv64-linux-gnu/include', '-Wextra', '-Werror', '-march=rv64imafdcv_zihintpause_zfh_zba_zbb_zbc_zbs_zicsr_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b', '-O3', '-mrvv-vector-bits=128']
c_link_args = ['--target=riscv64-linux-gnu', '-static', '-static-libgcc']
cpp_link_args = ['--target=riscv64-linux-gnu', '-static', '-static-libgcc', '-static-libstdc++']

[host_machine]
system = 'linux'
cpu_family = 'riscv64'
cpu = 'thead-c906'
endian = 'little'

camel-cdr commented 6 months ago

Here's my meson setup --cross ... config for using https://packages.debian.org/unstable/clang-18 and running on the official Debian image:

Thanks it worked. I didn't know about the debian image, and was using the k230_sdk thingy.

Running for i in *native*; do ./$i; done in the arm/neon directory results in the following errors:

../test/arm/neon/fma_lane.c:1163: assertion failed: r1[0] ~= simde_vld1q_f32(test_vec[i].r1)[0] (-382857.250000 ~= 503169.843750)
test/arm/neon/fma_lane.cpp:1163: assertion failed: r1[0] ~= simde_vld1q_f32(test_vec[i].r1)[0] (-382857.250000 ~= 503169.843750)
../test/arm/neon/fms_lane.c:873: assertion failed: r1[0] ~= simde_vld1q_f32(test_vec[i].r1)[0] (-506717.906250 ~= 554470.187500)
test/arm/neon/fms_lane.cpp:873: assertion failed: r1[0] ~= simde_vld1q_f32(test_vec[i].r1)[0] (-506717.906250 ~= 554470.187500)
../test/arm/neon/mul_lane.c:865: assertion failed: r[0] ~= simde_vld1q_f32(test_vec[i].r)[0] (132874.984375 ~= 347499.312500)
test/arm/neon/mul_lane.cpp:865: assertion failed: r[0] ~= simde_vld1q_f32(test_vec[i].r)[0] (132874.984375 ~= 347499.312500)
../test/arm/neon/mulx_lane.c:315: assertion failed: r[0] ~= simde_vld1q_f32(test_vec[i].r)[0] (132874.984375 ~= 347499.312500)
test/arm/neon/mulx_lane.cpp:315: assertion failed: r[0] ~= simde_vld1q_f32(test_vec[i].r)[0] (132874.984375 ~= 347499.312500)
../test/arm/neon/qrdmlah.c:193: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (17480 == 3378)
test/arm/neon/qrdmlah.cpp:193: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (17480 == 3378)
../test/arm/neon/qrdmlah_lane.c:475: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (-18972 == -13752)
test/arm/neon/qrdmlah_lane.cpp:475: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (-18972 == -13752)
../test/arm/neon/qrdmlsh.c:197: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (9556 == -32768)
test/arm/neon/qrdmlsh.cpp:197: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (9556 == -32768)
../test/arm/neon/qrdmlsh_lane.c:475: assertion failed: r[2] == simde_vld1_s16(test_vec[i].r)[2] (30372 == -32768)
test/arm/neon/qrdmlsh_lane.cpp:475: assertion failed: r[2] == simde_vld1_s16(test_vec[i].r)[2] (30372 == -32768)
../test/arm/neon/qrdmulh_lane.c:264: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r0)[0] (0 == 14610)
test/arm/neon/qrdmulh_lane.cpp:264: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r0)[0] (0 == 14610)
../test/arm/neon/uqadd.c:339: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (181 == 32767)
test/arm/neon/uqadd.cpp:339: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (181 == 32767)

mr-c commented 6 months ago

@camel-cdr Thanks!

Yeah, I'm now also seeing those errors.

I wonder why the qemu setup in this PR isn't reproducing them?

I hope it isn't a hardware error! :-)

mr-c commented 6 months ago

Also, hello @eric900115 and @camel-cdr from the Debian Med Sprint in Berlin. Maybe you can join us in person next year? https://wiki.debian.org/Sprints/2023/DebianMed2024

camel-cdr commented 6 months ago

Sounds interesting, Berlin is only 2-3 hours away from me. But I'm not really involved with Debian (except for running it).

Btw, do you know how Debian deals with compiler bugs? I just ran into an gcc-13.2.0 codegen bug, that causes a valid program to not work vsetvli a5,a1,e8,m8,ta,ma should be vsetvli a5,a1,e8,m8,tu,ma. It's been fixed on trunk, but is this a thing that would be back-ported?

mr-c commented 6 months ago

Btw, do you know how Debian deals with compiler bugs? I just ran into an gcc-13.2.0 codegen bug, that causes a valid program to not work vsetvli a5,a1,e8,m8,ta,ma should be vsetvli a5,a1,e8,m8,tu,ma. It's been fixed on trunk, but is this a thing that would be back-ported?

I would personally respond positively to a reportbug gcc-13 with a link to the upstream fix, but I don't know that team so I can't make promises.

Sounds interesting, Berlin is only 2-3 hours away from me. But I'm not really involved with Debian (except for running it).

Anyone is welcome! We appreciate the user perspective!

eric900115 commented 6 months ago

@camel-cdr Thanks!

Yeah, I'm now also seeing those errors.

I wonder why the qemu setup in this PR isn't reproducing them?

I hope it isn't a hardware error! :-)

I am also wondering. I'll try to use qemu with same configuration for testing (testing with thread-c906 CPU).

mr-c commented 6 months ago

@camel-cdr Do you also get failures on the k230 with the current master branch of SIMDe?

I'm seeing failures in

arm/neon/{q,}abs/{emul,native}/{c,cpp}
arm/neon/qrdml{a,s}h_lane/{emul,native}/{c,cpp}
arm/neon/st3/{emul,native}/{c,cpp}

So I guess there are some clang and/or CPU errors .. ?

camel-cdr commented 6 months ago

@mr-c Yes, I get similar errors when testing master:

../test/arm/neon/abs.c:711: assertion failed: r[0] == simde_vld1q_s32(test_vec[i].r)[0] (0 == -2147483648)
test/arm/neon/abs.cpp:711: assertion failed: r[0] == simde_vld1q_s32(test_vec[i].r)[0] (0 == -2147483648)
timeout qabs-native-c
timeout qabs-native-cpp
../test/arm/neon/qrdmlah_lane.c:475: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (-18972 == -13752)
../test/arm/neon/qrdmlah_lane.c:678: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (26528 == 18592)
../test/arm/neon/qrdmlah_lane.c:901: assertion failed: r[0] == simde_vld1q_s16(test_vec[i].r)[0] (-1308 == 32767)
../test/arm/neon/qrdmlah_lane.c:1128: assertion failed: r[0] == simde_vld1q_s16(test_vec[i].r)[0] (-13600 == 25250)
test/arm/neon/qrdmlah_lane.cpp:475: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (-18972 == -13752)
test/arm/neon/qrdmlah_lane.cpp:678: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (26528 == 18592)
test/arm/neon/qrdmlah_lane.cpp:901: assertion failed: r[0] == simde_vld1q_s16(test_vec[i].r)[0] (-1308 == 32767)
test/arm/neon/qrdmlah_lane.cpp:1128: assertion failed: r[0] == simde_vld1q_s16(test_vec[i].r)[0] (-13600 == 25250)
../test/arm/neon/qrdmlsh_lane.c:475: assertion failed: r[2] == simde_vld1_s16(test_vec[i].r)[2] (30372 == -32768)
../test/arm/neon/qrdmlsh_lane.c:1128: assertion failed: r[3] == simde_vld1q_s16(test_vec[i].r)[3] (26847 == -32768)
test/arm/neon/qrdmlsh_lane.cpp:475: assertion failed: r[2] == simde_vld1_s16(test_vec[i].r)[2] (30372 == -32768)
test/arm/neon/qrdmlsh_lane.cpp:1128: assertion failed: r[3] == simde_vld1q_s16(test_vec[i].r)[3] (26847 == -32768)

I'm somewhat inclined to believe it's a clang miss-compilation, because I had a gcc-13.2 miss-compilation yesterday, I suppose we need to investigate this somehow.

mr-c commented 6 months ago

Hey @camel-cdr ; in #1141 I fixed some of the NEON abs functions. Maybe you have time to re-run the tests?

camel-cdr commented 6 months ago

@mr-c Here we go, looks like the abs errors are gone, great work.

../test/arm/neon/qrdmlah_lane.c:475: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (-18972 == -13752)
../test/arm/neon/qrdmlah_lane.c:678: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (26528 == 18592)
../test/arm/neon/qrdmlah_lane.c:901: assertion failed: r[0] == simde_vld1q_s16(test_vec[i].r)[0] (-1308 == 32767)
../test/arm/neon/qrdmlah_lane.c:1128: assertion failed: r[0] == simde_vld1q_s16(test_vec[i].r)[0] (-13600 == 25250)
test/arm/neon/qrdmlah_lane.cpp:475: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (-18972 == -13752)
test/arm/neon/qrdmlah_lane.cpp:678: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (26528 == 18592)
test/arm/neon/qrdmlah_lane.cpp:901: assertion failed: r[0] == simde_vld1q_s16(test_vec[i].r)[0] (-1308 == 32767)
test/arm/neon/qrdmlah_lane.cpp:1128: assertion failed: r[0] == simde_vld1q_s16(test_vec[i].r)[0] (-13600 == 25250)
../test/arm/neon/qrdmlsh_lane.c:475: assertion failed: r[2] == simde_vld1_s16(test_vec[i].r)[2] (30372 == -32768)
../test/arm/neon/qrdmlsh_lane.c:1128: assertion failed: r[3] == simde_vld1q_s16(test_vec[i].r)[3] (26847 == -32768)
test/arm/neon/qrdmlsh_lane.cpp:475: assertion failed: r[2] == simde_vld1_s16(test_vec[i].r)[2] (30372 == -32768)
test/arm/neon/qrdmlsh_lane.cpp:1128: assertion failed: r[3] == simde_vld1q_s16(test_vec[i].r)[3] (26847 == -32768)

eric900115 commented 6 months ago

I have modified mul_lane and mulx_lane. Hope the error in fms_lane, fma_lane, mul_lane, and mulx_lane will be eliminated.

OMaghiarIMG commented 6 months ago

Hello @eric900115, this is really good stuff. I have a question, you mentioned you converted all Neon intrinsics to RVV, does that exclude bf16 and cryptography instructions which may not be easily replicated with base V? I think trunk LLVM contains experimental intrinsics for Zvfbfwma and Vector crypto.

Is there anything you might need help with?

mr-c commented 6 months ago

Thanks @camel-cdr ; can you retest the latest?

camel-cdr commented 6 months ago

@mr-c the errors are still there, but the values are different now:

../test/arm/neon/qrdmlah.c:193: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (17480 == 3378)
test/arm/neon/qrdmlah.cpp:193: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (17480 == 3378)
../test/arm/neon/qrdmlah_lane.c:475: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (-18972 == -13752)
test/arm/neon/qrdmlah_lane.cpp:475: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (-18972 == -13752)
../test/arm/neon/qrdmlsh.c:197: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (9556 == -32768)
test/arm/neon/qrdmlsh.cpp:197: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (9556 == -32768)
../test/arm/neon/qrdmlsh_lane.c:475: assertion failed: r[2] == simde_vld1_s16(test_vec[i].r)[2] (30372 == -32768)
test/arm/neon/qrdmlsh_lane.cpp:475: assertion failed: r[2] == simde_vld1_s16(test_vec[i].r)[2] (30372 == -32768)
../test/arm/neon/qrdmulh_lane.c:264: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r0)[0] (0 == 14610)
test/arm/neon/qrdmulh_lane.cpp:264: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r0)[0] (0 == 14610)
../test/arm/neon/uqadd.c:339: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (181 == 32767)
test/arm/neon/uqadd.cpp:339: assertion failed: r[0] == simde_vld1_s16(test_vec[i].r)[0] (181 == 32767)

mr-c commented 6 months ago

@camel-cdr are those errors from the emul or native tests?

camel-cdr commented 6 months ago

@mr-c it was the native tests, I ran it via: for i in *native*; do ./$i; done > /dev/null

eric900115 commented 5 months ago

@OMaghiarIMG

Hi! Yes, we excluded BF16 and cryptography for conversion.

For the conversion from NEON to RVV, if the performance (instruction counts) of using single or multiple RVV intrinsics is better than automatic vectorization, then we use RVV intrinsics for implementation. Otherwise, we use loop automatic vectorization from SIMDe.

mr-c commented 5 months ago

Thank you @eric900115 ! Now that SIMDe 0.8.0 is released we can focus the next development cycle on RVV 1.0 implementations.

simd-everywhere / simde

Initial Support for the RISC-V Vector Extension in ARM NEON #1130