[RISC-V] Worse performance with interleaved input buffer (RVV 0.7.1)

An issue found during experiments with Halide on RISC-V chip AllWinner D1 which supports only RVV 0.7.1. See https://github.com/halide/Halide/discussions/7252 for details, but these are full steps to reproduce:

LLVM https://github.com/dkurt/llvm-rvv-071/tree/rvv-071 (based on releases/16.x branch)
Halide https://github.com/halide/Halide/commit/7963cd4e3c23856b82567c99e0a3d16035ffe895 with patch to disable vle64.v and vse64.v:

patch

```patch diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt index 4f4b8e532..8f401c442 100644 --- a/src/CMakeLists.txt +++ b/src/CMakeLists.txt @@ -540,7 +540,7 @@ endif () if (BUILD_SHARED_LIBS) message(STATUS "Building autoschedulers enabled") - add_subdirectory(autoschedulers) + # add_subdirectory(autoschedulers) else () message(STATUS "Building autoschedulers disabled (static Halide)") endif () diff --git a/src/CodeGen_RISCV.cpp b/src/CodeGen_RISCV.cpp index ba9abe04d..454558d11 100644 --- a/src/CodeGen_RISCV.cpp +++ b/src/CodeGen_RISCV.cpp @@ -151,6 +151,7 @@ string CodeGen_RISCV::mattrs() const { arch_flags += ",+zvl" + std::to_string(target.vector_bits) + "b"; } #endif + arch_flags += ",-zve64x"; } return arch_flags; } ```

Compile and run application for AOT algorithm:

main.cpp

```cpp #include using namespace Halide; const int width = 1920; const int height = 1080; int main(int argc, char** argv) { Func f("bgr2gray"); Var x("x"), y("y"), c("c"); uint16_t R2GRAY = 77.0f, G2GRAY = 150, B2GRAY = 29; bool interleaved = true; bool rvv = true; Buffer input = interleaved ? Buffer::make_interleaved(width, height, 3) : Buffer(width, height, 3); if (interleaved && rvv) { Buffer scales(3); scales(0) = R2GRAY; scales(1) = G2GRAY; scales(2) = B2GRAY; // RDom helps prevent adding vl4r.v instructions RDom r(0, 3); Expr res = sum(input(x, y, r) * scales(r)) >> 8; f(x, y) = res; } else { Expr r = input(x, y, 0); Expr g = input(x, y, 1); Expr b = input(x, y, 2); Expr res = (R2GRAY * r + G2GRAY * g + B2GRAY * b) >> 8; f(x, y) = res; } f.bound(x, 0, width).bound(y, 0, height); if (rvv) f.vectorize(x, 8); // Compile Target target; target.os = Target::OS::Linux; target.arch = Target::Arch::RISCV; target.bits = 64; if (rvv) target.vector_bits = 8 * sizeof(uint16_t) * 8; std::vector features; if (rvv) features.push_back(Target::RVV); features.push_back(Target::NoAsserts); target.set_features(features); std::cout << target << std::endl; f.print_loop_nest(); try { f.compile_to_header("bgr2gray.h", {input}, "bgr2gray", target); f.compile_to_assembly("bgr2gray.s", {input}, "bgr2gray", target); } catch(Halide::InternalError& ex) { std::cout << ex.what() << std::endl; } catch(Halide::CompileError& ex) { std::cout << ex.what() << std::endl; } return 0; } ```

While RVV for planar input buffer is fine, there is a performance issue for the interleaved input. Not sure that issue with RDom, but I use it to avoid whole-register load/store instructions from RVV 1.0. Attaching generated assembly because it may be useful (I don't understand it that's why asking for help here).

1920x1080 input	Median time	Assembly
Planar input, no RVV	41ms	planar.s
Interleaved input, no RVV	36ms	interleaved.s
Planar input, `vectorize(x, 8)`	13ms	planar_rvv.s
Interleaved input, `vectorize(x, 8)`	92ms	interleaved_rvv.s

RVV 0.7.1 spec RVV 1.0 spec

halide / Halide

[RISC-V] Worse performance with interleaved input buffer (RVV 0.7.1) #7360