Open hanliutong opened 2 years ago
@hanliutong, thanks, it looks pretty good!
some questions/comments:
@fpetrogalli, probably, you'd be interested to look at it. Yes, it's slow progress, but at least for RISC-V we seemingly found proper compiler-friendly solution, which is to define aliases for our intrinsic types typedef <native_intrinsic_vec_type> v_float32;
instead of wrapping them inside structures struct v_float32 { <native_intrinsic_vec_type> val; };
which does not work properly. We will try to do complete experimental backend for RVV 1.0 this summer, SVE2 variant could be the next step
v_float32x4
We still have a lot of code which assumes some SIMD width (SIMD128 in general). So we should emulate them, skip these existed optimizations if not supported, or rewrite them.
Please note, that external contributions brings SIMD128 at the best stage, learning and tuning of variable-size SIMD is a rocket science.
@vpisarev , thanks for your comments!
- is the term size-less proper? shall we use variable-size instead?
I think "size-less" is a term in RVV, especially in compiler field: When we try to wrapper a rvv native type in a struct, clang will report error: field has sizeless type 'vfloat32m1_t' (aka '__rvv_float32m1_t')
. But yes, I agree that "variable-size" or "scalable-size" can better describe the state of the register and easyer to understand. So which one would you prefer?
- std::initializer_list<> and 3. max_nlanes
I'm going to try it on my experimental project.
- add VTraits
Did you mean VTraits<float>::nlanes == VTraits<v_float32>::nlanes
? becasue VTraits
is a template class and VTraits::nlanes
may not work.
- "possible alternatives" section
Yes, it should be a template implementation, I'm going to update it.
I have already add vload(<scalar_0>, <scalar_1>, ..., <scalar_N>)
and VTraits<type>::max_nlanes
on my example project.
I also found that for some binary operators, which with associativity, the code would be verbose if we only supported two arguments. Example in v = v0 + v1 + v2
:
// with v_add(<vec0>, <vec1>) only
v = v_add(v_add(v0, v1), v2);
vs.
// with v_add(<vec0>, <vec1>, ... , <vecN>)
v = v_add(v0, v1, v2);
Therefore, I propose to add a new UI API for some operators (+ * & | etc.)
to support multiple parameters.
See https://github.com/hanliutong/rvv-ui/pull/1 for detail
This is a draft evolution proposal, we are going to working on RISC-V Vector acceleration by modifying universal intrinsics and we'd like to make sure that the proposed design is compatible with other size-less architectures.
OE xx: Modify universal intrinsics for size-less architectures
Introduction and Rationale
Currently, OpenCV includes universal intrinsics that cover different SIMD extensions on different platforms, such as
SSE, AVX2, AVX512
on IA,NEON
on ARM andVSX
on PPC64. All the above arch is designed with fixed register size. Nowadays, there are also some vector ISA extensions is designed with size-less instruction, such as RISC-V Vector (aka. RVV) and ARM SVE/SVE2. Taking RVV as an example, different devices that support the RVV extension may have different vector register sizes, any size powered of 2 && between 128 bits and 65536 bits is legally for a RVV application processor. Conversely, any AVX2 devices have a set of vector registers with 256 bits fixed-size.There is also a RISC-V Vector backend in universal intrinsic, but the performance is very poor and only supported fixed vector size with 128 bit. The reason for the performance issue is speculated to be redundant instructions introduced by the wrapper class. And the fundamental problem is that the current design of Universal Intrinsic may not suitable for size-less architecture.
So it would be nice to modify the API and implementation of universal intrinsics for size-less architectures. Also, the modifications may affect existing code written in the universal intrinsics, but we hope the modifications to be minor.
Proposed solution
The main modification is to directly use the vector built-in type (e.g.
vfloat32m1_t
in RVV intrinsic) as a universal intrinsic type (e.g. v_float32 ) in the backend of the size-less architecture and introduce theVectorTraits
class to handle the class member (nlanes
andlane_type
) in the current wrapper classes.An example project is in https://github.com/hanliutong/rvv-ui, particularly in
intrin_rvv_new.hpp
.Impact on existing code, compatibility
v_float32
etc. as vector type. And currently vector type such asv_float32x4
orv_float32x8
can be renamed tov_float32
v0 + v1
should be rewritten byv_add(v0, v1)
v_float32x4(float v0, float v1, float v2, float v3)
. But for a scalable vector, we do not know the number of scalars in the vector and there is no wrapper class then no constructors. So we may need to also introduce a universal intrinsic function overload, such asv_load(std::initializer_list<float> nScalars)
v_get0(v1)
instead ofv1.get0()
can be introduced.VTraits<v_float32>::lane_type
instead ofv_float32::lane_type
VTraits<v_float32>::nlanes
instead ofv_float32::nlanes
. Another modification related tonlanes
is that since the new nlanes will be a static variable rather than a constant, the current use ofnlanes
in array declaration or initialization likefloat foo[v_float32::nlanes]
can not be used anymore in size-less universal intrinsic.Possible alternatives
Write ~a specific~ the fixed-size universal intrinsic for each possible vector size by using template, and treat minimum vector size as the baseline. In this case, the same instruction set (e.g. RVV) will have different binary (RVV128, RVV256, etc.).
References