Implement RISC-V Vector 1.0 kernels

camel-cdr commented 1 month ago

This PR adds full RVV vectorization to all kernels, excluding deprecated ones and volk_32f_s32f_power_32f and volk_32fc_s32f_power_32fc which don't have any vectorized kernels in for other architectures, presumably due to precision requirements.

All the tests pass in qemu with VLEN 128 up to 1024, and were additionally tested on all input sizes from 0 to 1000. They have been integrated into CI, with both clang and gcc.

I've attached the output of volk_profile running on the SpacemiT X60 cores of the Banana Pi BPI-F3 SOC: X60-gcc-14.txt. X60-clang-18.txt The average speedup across all kernels compared to the previously fastest one was 3.8x.

The code is written using RVV 1.0 intrinsics, following the frozen v0.12 spec, which is supported by gcc 14 and clang 18 and above. The build system was adjusted to detect support for RVV intrinsics, to make sure we don't break builds on older compilers.

I tried to maximize LMUL without causing register spills, while avoiding lane crossing permutations.

Segmented load/stores varies in performance a lot on current systems, compare the vsseg graph between the C910 and X60 . To get the best performance I didn't use it in the rvv target, but created an additional pseudo target rvvseg, with alternative implementations using segmented load/stores.

I didn't modify the existing kernels, except for the following cases:

volk_32u_popcntpuppet_32u.h and volk_64u_popcntpuppet_64u.h had a bug where they didn't compute anything.
volk_8u_conv_k7_r2puppet_8u.h created a lookup table with 256 entries once, but only used for a total of 64 lookups, so I replaced it with the direct calculation.

This should resolve https://github.com/gnuradio/volk/issues/772

drmpeg commented 1 month ago

Wow, I gotta get my BPI-F3 set up. I'll be testing with GNU Radio.

camel-cdr commented 1 month ago

Thanks to cloud-v.co I got access to a machine with C910 cores, which doesn't support RVV 1.0, but rather the XTheadVector custom extension based on the RVV 0.7.1 draft specification, with VLEN=128. gcc allows you to compile (some) RVV 1.0 intrinsics directly to XTheadVector, which allowed me to benchmark on the CPU as well. These numbers are interesting because the C910 is an dual issue out-of-order CPU compared to the in-order X60.

The XTheadVector target for gcc isn't perfect, so I couldn't get every kernel to compile properly, but here are the results from the once that worked:

                                  before   RVV speedup                                           before  RVV speedup                                before  RVV speedup
32f_s32f_convert_16i                4457   128 34.770   32f_reciprocal_32f                         1570  393 3.99455   64f_x2_min_64f                 1760 1070 1.64526
32f_sqrt_32f                       12923   381 33.843   32f_x2_min_32f                             1702  427 3.97885   32fc_magnitude_32f             1574 1000 1.57403
32f_log2_32f                       14068   469 29.972   16ic_s32f_magnitude_32f                    1651  509 3.2431    64f_x2_max_64f                 1768 1162 1.52126
32f_s32f_convertpuppet_8u           4057   152 26.646   32i_s32f_convert_32f                        412  130 3.1648    32f_index_max_16u               138   90 1.51793
32f_expfast_32f                     3141   127 24.621   32f_sin_32f                                6903 2183 3.16171   32f_s32f_multiply_32f           265  206 1.28825
32f_s32f_convert_8i                 3235   138 23.365   32fc_s32f_magnitude_16i                    3016  954 3.16168   16ic_deinterleave_16i_x2        588  515 1.14099
16i_s32f_convert_32f                1509    69 21.566   32f_x2_powpuppet_32f                       9276 2957 3.13613   32fc_x2_divide_32fc            3177 2832 1.1216
32fc_convert_16ic                   8910   471 18.909   32f_cos_32f                                6726 2191 3.06938   32f_x2_add_32f                  407  395 1.03269
16ic_magnitude_16i                 10076   574 17.531   16ic_s32f_deinterleave_32f_x2              3048 1006 3.02943   32f_x2_multiply_32f             416  403 1.03164
32f_s32f_convert_32i                2096   130 16.024   8ic_deinterleave_real_16i                   266   88 2.99781   32f_x2_subtract_32f             422  451 0.935161
32f_s32f_clamppuppet_32f            1845   123 14.889   32f_s32f_add_32f                            273   99 2.75712   32fc_deinterleave_64f_x2       1786 2032 0.878856
8ic_x2_multiply_conjugate_16ic      2224   234 9.4952   32f_x2_divide_32f                          1575  586 2.68736   32fc_conjugate_32fc             412  483 0.853893
32f_exp_32f                         7848   857 9.1584   16ic_deinterleave_real_16i                  269  103 2.59342   32fc_index_max_32u              787  949 0.829749
32f_asin_32f                       17823  2289 7.7839   32f_index_min_32u                           290  115 2.52049   64f_x2_add_64f                  694 1017 0.682631
32f_acos_32f                       18806  2476 7.5945   32f_s32f_normalize                          311  127 2.44517   32fc_deinterleave_real_32f      279  430 0.649423
32f_8u_polarbutterflypuppet_32f   107451 14240 7.5455   32f_index_max_32u                           281  115 2.44065   64f_x2_multiply_64f             705 1107 0.637096
32f_tan_32f                        15314  2428 6.3061   32fc_x2_s32fc_multiply_conjugate_add2_32fc 1172  520 2.25319   32i_x2_and_32i                  265  426 0.623177
8ic_deinterleave_real_8i             271    52 5.1638   32f_x2_fm_detectpuppet_32f                  959  426 2.24964   32fc_x2_add_32fc                707 1175 0.602147
16i_convert_8i                       262    52 4.9657   32f_convert_64f                             278  126 2.19516   32i_x2_or_32i                   261  442 0.590342
32f_s32f_stddev_32f                  393    84 4.6301   16ic_s32f_deinterleave_real_32f             397  185 2.14181   64f_convert_32f                 266  454 0.586313
32f_x2_s32f_interleave_16ic         3825   862 4.4377   32fc_index_max_16u                          458  218 2.10224   32fc_deinterleave_imag_32f      267  484 0.55243
32f_stddev_and_mean_32f_x2           839   199 4.2042   32fc_s32f_deinterleave_real_16i            1968  938 2.09685   32fc_deinterleave_32f_x2        920 1724 0.533851
8ic_deinterleave_16i_x2              527   125 4.1961   16ic_deinterleave_real_8i                   262  125 2.08729   32f_64f_multiply_64f            445  847 0.526016
16ic_convert_32fc                    524   126 4.1332   32f_binary_slicer_32i                       266  131 2.02717   32fc_deinterleave_real_64f      266  538 0.495517
32fc_s32f_atan2_32f                 8978  2224 4.0358   32f_atan_32f                               1557  829 1.87756   32fc_magnitude_squared_32f      399  868 0.459629
32f_x2_max_32f                      1703   422 4.0325   32f_index_min_16u                           171   91 1.87702   32fc_x2_multiply_conjugate_32fc 751 2453 0.306232
8ic_x2_s32f_multiply_conjugate_32fc 1820   454 4.0008   32f_x2_interleave_32fc                      432  231 1.86693   32fc_32f_multiply_32fc          533 1795 0.297167
32f_tanh_32f                        1573   393 3.9980   8u_x3_encodepolarpuppet_8u                 3318 1803 1.83999   32fc_x2_multiply_32fc           737 2761 0.266857

Keep in mind that the codegen for XTheadVector is often worse than for RVV 1.0, because additional instructions need to be inserted. Still these speedups looks quite good, some things are a lot faster, and a few are a lot slower than scalar, but that's what volk_profile is fore. Interesting to note is that the rvvseg variant was, as expected, always slower than the rvv variants, because the segmented load/store implementation is quite bad on the processor.

Edit: The latest force push fixed register spills in two kernels

camel-cdr commented 4 weeks ago

I ran clang-format on all changed files, so the formatting issues should hopeful be fixed now.

jdemel commented 4 weeks ago

I ran clang-format on all changed files, so the formatting issues should hopeful be fixed now.

Thanks! Also, I was doing a review and got an "outdated file" (or smth similar) error at some point. Was already worried that comments got lost. Luckily they aren't. Thanks for taking care of the formatting stuff.

jdemel commented 4 weeks ago

You probably need to rebase your PR on the latest main. The CI seems to have received a few changes that need fixing to make it work again.

drmpeg commented 3 weeks ago

I got my BPI-F3 running today. I'm using Bianbu 2.0.1, but I've ripped out the desktop and just using it headless. Here's my volk_profile results.

BPI-F3-gcc-14.txt

Should have GNU Radio running tomorrow.

drmpeg commented 3 weeks ago

Tested with GNU Radio. Works perfectly for at least these kernels:

volk_32fc_32f_multiply_32fc
volk_32fc_x2_add_32fc
volk_32f_s32f_multiply_32f
volk_32fc_magnitude_32f
volk_32fc_s32fc_multiply2_32fc
volk_32fc_s32fc_multiply_32fc
volk_32f_x2_subtract_32f
volk_32fc_x2_multiply_conjugate_32fc
volk_32f_x2_add_32f
volk_32fc_x2_multiply_32fc

jdemel commented 3 weeks ago

Thanks for this huge contribution! I'll do a release soon. I hope you don't mind that I mention your name (as it appears in the sign-off) in the release changelog.

camel-cdr commented 3 weeks ago

Thanks for this huge contribution! I'll do a release soon. I hope you don't mind that I mention your name (as it appears in the sign-off) in the release changelog.

Yeah, that be great.

gnuradio / volk

Implement RISC-V Vector 1.0 kernels #774