Closed camel-cdr closed 3 weeks ago
Wow, I gotta get my BPI-F3 set up. I'll be testing with GNU Radio.
Thanks to cloud-v.co I got access to a machine with C910 cores, which doesn't support RVV 1.0, but rather the XTheadVector custom extension based on the RVV 0.7.1 draft specification, with VLEN=128. gcc allows you to compile (some) RVV 1.0 intrinsics directly to XTheadVector, which allowed me to benchmark on the CPU as well. These numbers are interesting because the C910 is an dual issue out-of-order CPU compared to the in-order X60.
The XTheadVector target for gcc isn't perfect, so I couldn't get every kernel to compile properly, but here are the results from the once that worked:
before RVV speedup before RVV speedup before RVV speedup
32f_s32f_convert_16i 4457 128 34.770 32f_reciprocal_32f 1570 393 3.99455 64f_x2_min_64f 1760 1070 1.64526
32f_sqrt_32f 12923 381 33.843 32f_x2_min_32f 1702 427 3.97885 32fc_magnitude_32f 1574 1000 1.57403
32f_log2_32f 14068 469 29.972 16ic_s32f_magnitude_32f 1651 509 3.2431 64f_x2_max_64f 1768 1162 1.52126
32f_s32f_convertpuppet_8u 4057 152 26.646 32i_s32f_convert_32f 412 130 3.1648 32f_index_max_16u 138 90 1.51793
32f_expfast_32f 3141 127 24.621 32f_sin_32f 6903 2183 3.16171 32f_s32f_multiply_32f 265 206 1.28825
32f_s32f_convert_8i 3235 138 23.365 32fc_s32f_magnitude_16i 3016 954 3.16168 16ic_deinterleave_16i_x2 588 515 1.14099
16i_s32f_convert_32f 1509 69 21.566 32f_x2_powpuppet_32f 9276 2957 3.13613 32fc_x2_divide_32fc 3177 2832 1.1216
32fc_convert_16ic 8910 471 18.909 32f_cos_32f 6726 2191 3.06938 32f_x2_add_32f 407 395 1.03269
16ic_magnitude_16i 10076 574 17.531 16ic_s32f_deinterleave_32f_x2 3048 1006 3.02943 32f_x2_multiply_32f 416 403 1.03164
32f_s32f_convert_32i 2096 130 16.024 8ic_deinterleave_real_16i 266 88 2.99781 32f_x2_subtract_32f 422 451 0.935161
32f_s32f_clamppuppet_32f 1845 123 14.889 32f_s32f_add_32f 273 99 2.75712 32fc_deinterleave_64f_x2 1786 2032 0.878856
8ic_x2_multiply_conjugate_16ic 2224 234 9.4952 32f_x2_divide_32f 1575 586 2.68736 32fc_conjugate_32fc 412 483 0.853893
32f_exp_32f 7848 857 9.1584 16ic_deinterleave_real_16i 269 103 2.59342 32fc_index_max_32u 787 949 0.829749
32f_asin_32f 17823 2289 7.7839 32f_index_min_32u 290 115 2.52049 64f_x2_add_64f 694 1017 0.682631
32f_acos_32f 18806 2476 7.5945 32f_s32f_normalize 311 127 2.44517 32fc_deinterleave_real_32f 279 430 0.649423
32f_8u_polarbutterflypuppet_32f 107451 14240 7.5455 32f_index_max_32u 281 115 2.44065 64f_x2_multiply_64f 705 1107 0.637096
32f_tan_32f 15314 2428 6.3061 32fc_x2_s32fc_multiply_conjugate_add2_32fc 1172 520 2.25319 32i_x2_and_32i 265 426 0.623177
8ic_deinterleave_real_8i 271 52 5.1638 32f_x2_fm_detectpuppet_32f 959 426 2.24964 32fc_x2_add_32fc 707 1175 0.602147
16i_convert_8i 262 52 4.9657 32f_convert_64f 278 126 2.19516 32i_x2_or_32i 261 442 0.590342
32f_s32f_stddev_32f 393 84 4.6301 16ic_s32f_deinterleave_real_32f 397 185 2.14181 64f_convert_32f 266 454 0.586313
32f_x2_s32f_interleave_16ic 3825 862 4.4377 32fc_index_max_16u 458 218 2.10224 32fc_deinterleave_imag_32f 267 484 0.55243
32f_stddev_and_mean_32f_x2 839 199 4.2042 32fc_s32f_deinterleave_real_16i 1968 938 2.09685 32fc_deinterleave_32f_x2 920 1724 0.533851
8ic_deinterleave_16i_x2 527 125 4.1961 16ic_deinterleave_real_8i 262 125 2.08729 32f_64f_multiply_64f 445 847 0.526016
16ic_convert_32fc 524 126 4.1332 32f_binary_slicer_32i 266 131 2.02717 32fc_deinterleave_real_64f 266 538 0.495517
32fc_s32f_atan2_32f 8978 2224 4.0358 32f_atan_32f 1557 829 1.87756 32fc_magnitude_squared_32f 399 868 0.459629
32f_x2_max_32f 1703 422 4.0325 32f_index_min_16u 171 91 1.87702 32fc_x2_multiply_conjugate_32fc 751 2453 0.306232
8ic_x2_s32f_multiply_conjugate_32fc 1820 454 4.0008 32f_x2_interleave_32fc 432 231 1.86693 32fc_32f_multiply_32fc 533 1795 0.297167
32f_tanh_32f 1573 393 3.9980 8u_x3_encodepolarpuppet_8u 3318 1803 1.83999 32fc_x2_multiply_32fc 737 2761 0.266857
Keep in mind that the codegen for XTheadVector is often worse than for RVV 1.0, because additional instructions need to be inserted.
Still these speedups looks quite good, some things are a lot faster, and a few are a lot slower than scalar, but that's what volk_profile
is fore.
Interesting to note is that the rvvseg
variant was, as expected, always slower than the rvv
variants, because the segmented load/store implementation is quite bad on the processor.
Edit: The latest force push fixed register spills in two kernels
I ran clang-format on all changed files, so the formatting issues should hopeful be fixed now.
I ran clang-format on all changed files, so the formatting issues should hopeful be fixed now.
Thanks! Also, I was doing a review and got an "outdated file" (or smth similar) error at some point. Was already worried that comments got lost. Luckily they aren't. Thanks for taking care of the formatting stuff.
You probably need to rebase your PR on the latest main
. The CI seems to have received a few changes that need fixing to make it work again.
I got my BPI-F3 running today. I'm using Bianbu 2.0.1, but I've ripped out the desktop and just using it headless. Here's my volk_profile results.
Should have GNU Radio running tomorrow.
Tested with GNU Radio. Works perfectly for at least these kernels:
volk_32fc_32f_multiply_32fc
volk_32fc_x2_add_32fc
volk_32f_s32f_multiply_32f
volk_32fc_magnitude_32f
volk_32fc_s32fc_multiply2_32fc
volk_32fc_s32fc_multiply_32fc
volk_32f_x2_subtract_32f
volk_32fc_x2_multiply_conjugate_32fc
volk_32f_x2_add_32f
volk_32fc_x2_multiply_32fc
Thanks for this huge contribution! I'll do a release soon. I hope you don't mind that I mention your name (as it appears in the sign-off) in the release changelog.
Thanks for this huge contribution! I'll do a release soon. I hope you don't mind that I mention your name (as it appears in the sign-off) in the release changelog.
Yeah, that be great.
This PR adds full RVV vectorization to all kernels, excluding deprecated ones and
volk_32f_s32f_power_32f
andvolk_32fc_s32f_power_32fc
which don't have any vectorized kernels in for other architectures, presumably due to precision requirements.All the tests pass in qemu with VLEN 128 up to 1024, and were additionally tested on all input sizes from 0 to 1000. They have been integrated into CI, with both clang and gcc.
I've attached the output of volk_profile running on the SpacemiT X60 cores of the Banana Pi BPI-F3 SOC: X60-gcc-14.txt. X60-clang-18.txt The average speedup across all kernels compared to the previously fastest one was 3.8x.
The code is written using RVV 1.0 intrinsics, following the frozen v0.12 spec, which is supported by gcc 14 and clang 18 and above. The build system was adjusted to detect support for RVV intrinsics, to make sure we don't break builds on older compilers.
I tried to maximize LMUL without causing register spills, while avoiding lane crossing permutations.
Segmented load/stores varies in performance a lot on current systems, compare the vsseg graph between the C910 and X60 . To get the best performance I didn't use it in the rvv target, but created an additional pseudo target rvvseg, with alternative implementations using segmented load/stores.
I didn't modify the existing kernels, except for the following cases:
volk_32u_popcntpuppet_32u.h
andvolk_64u_popcntpuppet_64u.h
had a bug where they didn't compute anything.volk_8u_conv_k7_r2puppet_8u.h
created a lookup table with 256 entries once, but only used for a total of 64 lookups, so I replaced it with the direct calculation.This should resolve https://github.com/gnuradio/volk/issues/772