halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.85k stars 1.07k forks source link

[RISC-V] Worse performance with interleaved input buffer (RVV 0.7.1) #7360

Closed dkurt closed 1 year ago

dkurt commented 1 year ago

An issue found during experiments with Halide on RISC-V chip AllWinner D1 which supports only RVV 0.7.1. See https://github.com/halide/Halide/discussions/7252 for details, but these are full steps to reproduce:

  1. LLVM https://github.com/dkurt/llvm-rvv-071/tree/rvv-071 (based on releases/16.x branch)

  2. Halide https://github.com/halide/Halide/commit/7963cd4e3c23856b82567c99e0a3d16035ffe895 with patch to disable vle64.v and vse64.v:

patch ```patch diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt index 4f4b8e532..8f401c442 100644 --- a/src/CMakeLists.txt +++ b/src/CMakeLists.txt @@ -540,7 +540,7 @@ endif () if (BUILD_SHARED_LIBS) message(STATUS "Building autoschedulers enabled") - add_subdirectory(autoschedulers) + # add_subdirectory(autoschedulers) else () message(STATUS "Building autoschedulers disabled (static Halide)") endif () diff --git a/src/CodeGen_RISCV.cpp b/src/CodeGen_RISCV.cpp index ba9abe04d..454558d11 100644 --- a/src/CodeGen_RISCV.cpp +++ b/src/CodeGen_RISCV.cpp @@ -151,6 +151,7 @@ string CodeGen_RISCV::mattrs() const { arch_flags += ",+zvl" + std::to_string(target.vector_bits) + "b"; } #endif + arch_flags += ",-zve64x"; } return arch_flags; } ```
  1. Compile and run application for AOT algorithm:
main.cpp ```cpp #include using namespace Halide; const int width = 1920; const int height = 1080; int main(int argc, char** argv) { Func f("bgr2gray"); Var x("x"), y("y"), c("c"); uint16_t R2GRAY = 77.0f, G2GRAY = 150, B2GRAY = 29; bool interleaved = true; bool rvv = true; Buffer input = interleaved ? Buffer::make_interleaved(width, height, 3) : Buffer(width, height, 3); if (interleaved && rvv) { Buffer scales(3); scales(0) = R2GRAY; scales(1) = G2GRAY; scales(2) = B2GRAY; // RDom helps prevent adding vl4r.v instructions RDom r(0, 3); Expr res = sum(input(x, y, r) * scales(r)) >> 8; f(x, y) = res; } else { Expr r = input(x, y, 0); Expr g = input(x, y, 1); Expr b = input(x, y, 2); Expr res = (R2GRAY * r + G2GRAY * g + B2GRAY * b) >> 8; f(x, y) = res; } f.bound(x, 0, width).bound(y, 0, height); if (rvv) f.vectorize(x, 8); // Compile Target target; target.os = Target::OS::Linux; target.arch = Target::Arch::RISCV; target.bits = 64; if (rvv) target.vector_bits = 8 * sizeof(uint16_t) * 8; std::vector features; if (rvv) features.push_back(Target::RVV); features.push_back(Target::NoAsserts); target.set_features(features); std::cout << target << std::endl; f.print_loop_nest(); try { f.compile_to_header("bgr2gray.h", {input}, "bgr2gray", target); f.compile_to_assembly("bgr2gray.s", {input}, "bgr2gray", target); } catch(Halide::InternalError& ex) { std::cout << ex.what() << std::endl; } catch(Halide::CompileError& ex) { std::cout << ex.what() << std::endl; } return 0; } ```

While RVV for planar input buffer is fine, there is a performance issue for the interleaved input. Not sure that issue with RDom, but I use it to avoid whole-register load/store instructions from RVV 1.0. Attaching generated assembly because it may be useful (I don't understand it that's why asking for help here).

1920x1080 input Median time Assembly
Planar input, no RVV 41ms planar.s
Interleaved input, no RVV 36ms interleaved.s
Planar input, vectorize(x, 8) 13ms planar_rvv.s
Interleaved input, vectorize(x, 8) 92ms interleaved_rvv.s

RVV 0.7.1 spec RVV 1.0 spec

dkurt commented 1 year ago

Check https://github.com/YADRO-KNS/halide_riscv/pull/1. Were able to reduce compute time from 93ms to 30ms. Probably, need one more step for better optimization.