opencv / opencv

Open Source Computer Vision Library
https://opencv.org
Apache License 2.0
78.71k stars 55.79k forks source link

OE xx: Modify universal intrinsics for size-less architectures #21829

Open hanliutong opened 2 years ago

hanliutong commented 2 years ago

This is a draft evolution proposal, we are going to working on RISC-V Vector acceleration by modifying universal intrinsics and we'd like to make sure that the proposed design is compatible with other size-less architectures.

OE xx: Modify universal intrinsics for size-less architectures

Introduction and Rationale

Currently, OpenCV includes universal intrinsics that cover different SIMD extensions on different platforms, such as SSE, AVX2, AVX512 on IA, NEON on ARM and VSX on PPC64. All the above arch is designed with fixed register size. Nowadays, there are also some vector ISA extensions is designed with size-less instruction, such as RISC-V Vector (aka. RVV) and ARM SVE/SVE2. Taking RVV as an example, different devices that support the RVV extension may have different vector register sizes, any size powered of 2 && between 128 bits and 65536 bits is legally for a RVV application processor. Conversely, any AVX2 devices have a set of vector registers with 256 bits fixed-size.

There is also a RISC-V Vector backend in universal intrinsic, but the performance is very poor and only supported fixed vector size with 128 bit. The reason for the performance issue is speculated to be redundant instructions introduced by the wrapper class. And the fundamental problem is that the current design of Universal Intrinsic may not suitable for size-less architecture.

So it would be nice to modify the API and implementation of universal intrinsics for size-less architectures. Also, the modifications may affect existing code written in the universal intrinsics, but we hope the modifications to be minor.

Proposed solution

The main modification is to directly use the vector built-in type (e.g. vfloat32m1_t in RVV intrinsic) as a universal intrinsic type (e.g. v_float32 ) in the backend of the size-less architecture and introduce the VectorTraits class to handle the class member (nlanes and lane_type) in the current wrapper classes.

An example project is in https://github.com/hanliutong/rvv-ui, particularly in intrin_rvv_new.hpp.

Impact on existing code, compatibility

Possible alternatives

Write ~a specific~ the fixed-size universal intrinsic for each possible vector size by using template, and treat minimum vector size as the baseline. In this case, the same instruction set (e.g. RVV) will have different binary (RVV128, RVV256, etc.).

References

  1. universal intrinsics
  2. OE-27. Wide-Universal-Intrinsics
  3. Iusse: About the performance of opencv of the sizeless instruction
vpisarev commented 2 years ago

@hanliutong, thanks, it looks pretty good!

some questions/comments:

  1. is the term size-less proper? shall we use variable-size instead?
  2. the idea of using std::initializer_list<> is very good! hopefully, it can be implemented efficiently
  3. it's still important to have possibility to store/load vector register to/from memory. nlanes cannot be used, but there can be compile-time max_nlanes constant that users can specify. See https://github.com/opencv/opencv/pull/20562. In other words, when OpenCV is compiled for vector architectures, we do not fix vector size, but we set some maximum vector size that the binary will support. E.g. it can be 1024 bits (or 32 floats), then VTraits::max_nlanes will be set to 32 and users can use float foo[VTraits::max_nlanes] to allocate buffers on stack to store registers.
  4. maybe it would also be convenient to have VTraits, e.g. VTraits::nlanes == VTraits::nlanes.
  5. in the "possible alternatives" section I think it should be clarified that the proposed implementation for fixed-size vectors should be a template implementation, there should not be separate implementations (header files) for 128, 256 etc. vectors in any case, it should be a single header file.
vpisarev commented 2 years ago

@fpetrogalli, probably, you'd be interested to look at it. Yes, it's slow progress, but at least for RISC-V we seemingly found proper compiler-friendly solution, which is to define aliases for our intrinsic types typedef <native_intrinsic_vec_type> v_float32; instead of wrapping them inside structures struct v_float32 { <native_intrinsic_vec_type> val; }; which does not work properly. We will try to do complete experimental backend for RVV 1.0 this summer, SVE2 variant could be the next step

alalek commented 2 years ago

v_float32x4

We still have a lot of code which assumes some SIMD width (SIMD128 in general). So we should emulate them, skip these existed optimizations if not supported, or rewrite them.

Please note, that external contributions brings SIMD128 at the best stage, learning and tuning of variable-size SIMD is a rocket science.

hanliutong commented 2 years ago

@vpisarev , thanks for your comments!

  1. is the term size-less proper? shall we use variable-size instead?

I think "size-less" is a term in RVV, especially in compiler field: When we try to wrapper a rvv native type in a struct, clang will report error: field has sizeless type 'vfloat32m1_t' (aka '__rvv_float32m1_t'). But yes, I agree that "variable-size" or "scalable-size" can better describe the state of the register and easyer to understand. So which one would you prefer?

  1. std::initializer_list<> and 3. max_nlanes

I'm going to try it on my experimental project.

  1. add VTraits

Did you mean VTraits<float>::nlanes == VTraits<v_float32>::nlanes? becasue VTraits is a template class and VTraits::nlanes may not work.

  1. "possible alternatives" section

Yes, it should be a template implementation, I'm going to update it.

hanliutong commented 2 years ago

I have already add vload(<scalar_0>, <scalar_1>, ..., <scalar_N>) and VTraits<type>::max_nlanes on my example project.

I also found that for some binary operators, which with associativity, the code would be verbose if we only supported two arguments. Example in v = v0 + v1 + v2:

// with v_add(<vec0>, <vec1>) only
v = v_add(v_add(v0, v1), v2);

vs.

// with v_add(<vec0>, <vec1>, ... , <vecN>) 
v = v_add(v0, v1, v2);

Therefore, I propose to add a new UI API for some operators (+ * & | etc.) to support multiple parameters.

See https://github.com/hanliutong/rvv-ui/pull/1 for detail