OE xx: Modify universal intrinsics for size-less architectures

hanliutong commented 2 years ago

This is a draft evolution proposal, we are going to working on RISC-V Vector acceleration by modifying universal intrinsics and we'd like to make sure that the proposed design is compatible with other size-less architectures.

OE xx: Modify universal intrinsics for size-less architectures

Author: Liutong HAN, Vadim Pisarevsky
Link: TBD
Status: Draft
Platforms: All vector architectures (RISC-V Vector, ARM SVE2, etc.)
Complexity: TBD. Maybe a few man-months

Introduction and Rationale

Currently, OpenCV includes universal intrinsics that cover different SIMD extensions on different platforms, such as SSE, AVX2, AVX512 on IA, NEON on ARM and VSX on PPC64. All the above arch is designed with fixed register size. Nowadays, there are also some vector ISA extensions is designed with size-less instruction, such as RISC-V Vector (aka. RVV) and ARM SVE/SVE2. Taking RVV as an example, different devices that support the RVV extension may have different vector register sizes, any size powered of 2 && between 128 bits and 65536 bits is legally for a RVV application processor. Conversely, any AVX2 devices have a set of vector registers with 256 bits fixed-size.

There is also a RISC-V Vector backend in universal intrinsic, but the performance is very poor and only supported fixed vector size with 128 bit. The reason for the performance issue is speculated to be redundant instructions introduced by the wrapper class. And the fundamental problem is that the current design of Universal Intrinsic may not suitable for size-less architecture.

So it would be nice to modify the API and implementation of universal intrinsics for size-less architectures. Also, the modifications may affect existing code written in the universal intrinsics, but we hope the modifications to be minor.

Proposed solution

The main modification is to directly use the vector built-in type (e.g. vfloat32m1_t in RVV intrinsic) as a universal intrinsic type (e.g. v_float32 ) in the backend of the size-less architecture and introduce the VectorTraits class to handle the class member (nlanes and lane_type) in the current wrapper classes.

An example project is in https://github.com/hanliutong/rvv-ui, particularly in intrin_rvv_new.hpp.

Impact on existing code, compatibility

Vector type: As size-less intrinsic, we will use v_float32 etc. as vector type. And currently vector type such as v_float32x4 or v_float32x8 can be renamed to v_float32
intrinsic functions: Since there is no wrapper class for the vector type, we can not overload operators anymore. So we may need to introduce new universal intrinsic functions instead of overloaded operators. And v0 + v1 should be rewritten by v_add(v0, v1)
Scalars-to-vector constructors: The current Universal Intrinsic class uses scalars-to-vector constructors to build a SIMD vector like: v_float32x4(float v0, float v1, float v2, float v3). But for a scalable vector, we do not know the number of scalars in the vector and there is no wrapper class then no constructors. So we may need to also introduce a universal intrinsic function overload, such as v_load(std::initializer_list<float> nScalars)
Member function: Again, no class, so a new universal intrinsic function like v_get0(v1) instead of v1.get0() can be introduced.
lane_type: VTraits<v_float32>::lane_type instead of v_float32::lane_type
nlanes: VTraits<v_float32>::nlanes instead of v_float32::nlanes. Another modification related to nlanes is that since the new nlanes will be a static variable rather than a constant, the current use of nlanes in array declaration or initialization like float foo[v_float32::nlanes] can not be used anymore in size-less universal intrinsic.

Possible alternatives

Write ~a specific~ the fixed-size universal intrinsic for each possible vector size by using template, and treat minimum vector size as the baseline. In this case, the same instruction set (e.g. RVV) will have different binary (RVV128, RVV256, etc.).

References

vpisarev commented 2 years ago

@hanliutong, thanks, it looks pretty good!

some questions/comments:

is the term size-less proper? shall we use variable-size instead?
the idea of using std::initializer_list<> is very good! hopefully, it can be implemented efficiently
it's still important to have possibility to store/load vector register to/from memory. nlanes cannot be used, but there can be compile-time max_nlanes constant that users can specify. See https://github.com/opencv/opencv/pull/20562. In other words, when OpenCV is compiled for vector architectures, we do not fix vector size, but we set some maximum vector size that the binary will support. E.g. it can be 1024 bits (or 32 floats), then VTraits::max_nlanes will be set to 32 and users can use float foo[VTraits::max_nlanes] to allocate buffers on stack to store registers.
maybe it would also be convenient to have VTraits, e.g. VTraits::nlanes == VTraits::nlanes.
in the "possible alternatives" section I think it should be clarified that the proposed implementation for fixed-size vectors should be a template implementation, there should not be separate implementations (header files) for 128, 256 etc. vectors in any case, it should be a single header file.

vpisarev commented 2 years ago

@fpetrogalli, probably, you'd be interested to look at it. Yes, it's slow progress, but at least for RISC-V we seemingly found proper compiler-friendly solution, which is to define aliases for our intrinsic types typedef <native_intrinsic_vec_type> v_float32; instead of wrapping them inside structures struct v_float32 { <native_intrinsic_vec_type> val; }; which does not work properly. We will try to do complete experimental backend for RVV 1.0 this summer, SVE2 variant could be the next step

alalek commented 2 years ago

v_float32x4

We still have a lot of code which assumes some SIMD width (SIMD128 in general). So we should emulate them, skip these existed optimizations if not supported, or rewrite them.

Please note, that external contributions brings SIMD128 at the best stage, learning and tuning of variable-size SIMD is a rocket science.

hanliutong commented 2 years ago

@vpisarev , thanks for your comments!

is the term size-less proper? shall we use variable-size instead?

I think "size-less" is a term in RVV, especially in compiler field: When we try to wrapper a rvv native type in a struct, clang will report error: field has sizeless type 'vfloat32m1_t' (aka '__rvv_float32m1_t'). But yes, I agree that "variable-size" or "scalable-size" can better describe the state of the register and easyer to understand. So which one would you prefer?

std::initializer_list<> and 3. max_nlanes

I'm going to try it on my experimental project.

add VTraits

Did you mean VTraits<float>::nlanes == VTraits<v_float32>::nlanes? becasue VTraits is a template class and VTraits::nlanes may not work.

"possible alternatives" section

Yes, it should be a template implementation, I'm going to update it.

hanliutong commented 2 years ago

I have already add vload(<scalar_0>, <scalar_1>, ..., <scalar_N>) and VTraits<type>::max_nlanes on my example project.

I also found that for some binary operators, which with associativity, the code would be verbose if we only supported two arguments. Example in v = v0 + v1 + v2:

// with v_add(<vec0>, <vec1>) only
v = v_add(v_add(v0, v1), v2);

vs.

// with v_add(<vec0>, <vec1>, ... , <vecN>) 
v = v_add(v0, v1, v2);

Therefore, I propose to add a new UI API for some operators (+ * & | etc.) to support multiple parameters.

See https://github.com/hanliutong/rvv-ui/pull/1 for detail

opencv / opencv

OE xx: Modify universal intrinsics for size-less architectures #21829

OE xx: Modify universal intrinsics for size-less architectures

Introduction and Rationale

Proposed solution

Impact on existing code, compatibility

Possible alternatives

References