Introduction

Concept of a HAL (Hardware Abstraction Layer) is well-known. Basically it means that in some big software package (library of algorithms, operating system, photo editing application etc.) certain much smaller 'performance-critical' or 'hardware-dependent' subset of functionality is identified and shaped as a separate 'HAL' layer in software architecture. Such separation does not just makes the architecture cleaner, it also helps to port software to other hardware or even offload this work to hardware vendors as long as HAL API is well-specified and stable.

In the early days of OpenCV, as it was started at Intel, we used Intel IPP as such informal HAL. Later on we added Carotene - IPP-like acceleration library for ARM platform. At some point we figured out that we need more and more kernels or new variations of existing kernels (since computer vision algorithms evolve quite fast) and so it's inconvenient to create new HAL entries for each such kernel. It's more convenient to optimize hot loops right inside the algorithm. At that time (~2012) we already actively supported two platforms, x86 and ARM, and so we had to optimize each single 'hot' loop twice, using two quite differently looking but semantically very similar sets of SIMD instructions: SSE2 and NEON.

Universal intrinsics

To solve the problem with having to write and maintain multiple copies of a same optimized loop we decided to create light-weight wrappers on top of native intrinsics, which we called Universal intrinsics. We currently support different SIMD or vector architectures via this unified API, including SSE2-SSE4.2, AVX2, AVX512, NEON, VSX (PowerPC), RVV (RISC-V) etc.: https://github.com/opencv/opencv/tree/4.x/modules/core/include/opencv2/core/hal.

With time, universal intrinsics have been gradually extended to 'wide universal intrinsics' (to support wider than 128-bit vector registers, like in AVX2 or AVX512) and then to 'scalable universal intrinsics' to cover SVE2 in ARMv9 and RVV in RISC-V, i.e. architectures where vector register size is unknown at compile time and may vary from one manufacturer to another, one CPU model to another.

Universal intrinsics evolution in OpenCV 5.0

As such, upcoming OpenCV 5.0 or even OpenCV 4.x are quite future-proof in terms of SIMD support. But the following evolutionary changes in Universal Intrinsics API could be put to OpenCV 5.0:

CV_SIMD_16F, CV_SIMD_SCALABLE_16F flags to support FP16 and BF16 arithmetics where it's available (currently, it's ARMv8.2+ only) + the corresponding implementations of v_add(), v_fma(), v_expand() etc. We already have similar intrinsics CV_SIMD_64F and CV_SIMD_SCALABLE_64F to handle architectures where SIMD registers and ops support 64-bit float's (double's). Why implement FP16/BF16 univeral intrinsics when we have just one architecture that supports it?
- First of all, that will make our FP16 optimized loops more future-proof. x86, RVV and other instructions set must get support for FP16/BF16 in the future as these are very important data types nowadays.
- Secondly it will let people, who are familiar with OpenCV's universal intrinsics, to program new FP16 kernels and extend existing kernels to support FP16 without having to learn NEON and SVE2.
- Then, if we already have template implementations of some basic operations that use univeral intrinsics, we can quite easily extend those template functions to support FP16/BF16 without writing completely separate branches.
In order to further simplify implementation of template optimized functions that use universal intrinsics we are going to introduce template alternatives for vx_setzero_...() and vx_setall_...(), e.g. use vx_setzero<uint8_t>() instead of vx_setzero_u8(), vx_setall(1.f) instead of vx_setall_f32(1.f).
New non-trivial intrinsics:
- Math functions: v_exp() (#24941), v_log(), v_sin(), v_cos(), v_tanh(), v_atan2(), v_pow(), v_erf(), v_sigmoid() etc. Such functions may be very useful when coding efficient SIMD loops for image processing, deep learning and other domains. Note that we can start with generic implementations that will use basic universal intrinsics. This way we will cover all platforms without specialized implementations. Later on we could provide faster specialized implementations for concrete platforms.
- 1D/2D intepolation: linear/bilinear and cubic/bicubic interpolation. Such intrinsics help to implement table interpolation of some functions (e.g. in colorspace conversion kernels), implement image resize, image warping, remap etc. API still needs to be specified, but preliminary experiments show that on ARM w. NEON such interpolation can be done quite efficiently using specialized data permutation instructions (approximately, 3-4x faster than current warpAffine etc.) For other platforms some generic (but still vectorized) implementations can be provided.

CPU HAL beyond universal intrinsics: the current state

Sometimes the kernels are super performance-critical and they can be implemented most efficiently using some hardware-specific instructions that don't exist on other platforms. For example, ARM v8.6 introduced instruction to compute matrix product of two 2x4 and 4x2 BF16 matrices, which may roughly double the peak performance when computing MatMul or Conv2D deep learning kernels. But in order to make use of this instruction with close-to-peak performance, we need to rearrange input matrices (that can be done on-fly, block-by-block). In some other cases the fastest implementations of certain seemingly 'basic' algorithms, like GEMM or DFT, are very complex and require special external software packages.

In such cases the 'Universal intrinsics' approach will not work. Instead, a special low-level API, similar to IPP or Carotene, should be introduced for such performance-critical kernels. Such kernels should have default implementation inside OpenCV, but it should also be possible to compile the library with custom vendor-provided HAL that would override such low-level kernels (as it's done now when OpenCV is compiled with IPP).

In OpenCV 3.x such IPP-like CPU HAL has been introduced for this purpose. So far it has quite limited functionality:

a part of Core: primitive arithmetic operations, math functions, a few linear algebra functions and DFT/DCT.
a small part of Imgproc: several image filtering and color conversion kernels
a very small part of Features2D: FAST corner detector.

Besides the official HAL API, OpenCV 4.x also uses IPP directly via conditional compilation. The functionality coverage is roughly the same as with the official HAL API.

CPU HAL beyond universal intrinsics: 5.0 proposal

Since the HAL (beyond universal intrinsics) should have very stable and well-justified API, it's suggested to start with removing the current obsolete HAL API (which cover very little of OpenCV functionality, as described above) and then, after #25011 and #25012 are implemented and profiled, we can introduce this API. Probably, should postpone this part of the feature request till subsequent 5.x releases

In OpenCV 5 we plan to clean the code, and as a part of it, move IPP into a dedicated HAL library, just like Carotene now.

At the same time, we want to establish completely new, but more or less stable HAL API (which we will further extend) that will cover significantly bigger part of Core, Imgproc and DNN modules than now.

Here are the main features of the new HAL:

We are going to cover big part of element-wise operations, matrix and linear algebra operations from #25011. Note that we will have dedicated entries for AopS and AopA operations to support broadcasting efficiently.

Current API is quite bloated for the amount of functionality it offers (and we are going to cover 50x or 100x more functions), and there is noticeable overhead, especially for simple low-level kernels like add() or exp(). We are going to solve both problems at once. For pritimive functions that need to handle multiple data types we introduce cv_hal_get_..._func functions that will return pointer to particular optimized function (or nullptr if certain data type is not supported). For example, instead of

void add8u( const uchar* src1, size_t step1, const uchar* src2, size_t step2,
            uchar* dst, size_t step, int width, int height, void* );
void add8s( const schar* src1, size_t step1, const schar* src2, size_t step2,
            schar* dst, size_t step, int width, int height, void* );
void add16u( const ushort* src1, size_t step1, const ushort* src2, size_t step2,
             ushort* dst, size_t step, int width, int height, void* );
void add16s( const short* src1, size_t step1, const short* src2, size_t step2,
             short* dst, size_t step, int width, int height, void* );
void add32s( const int* src1, size_t step1, const int* src2, size_t step2,
             int* dst, size_t step, int width, int height, void* );
void add32f( const float* src1, size_t step1, const float* src2, size_t step2,
             float* dst, size_t step, int width, int height, void* );
void add64f( const double* src1, size_t step1, const double* src2, size_t step2,
             double* dst, size_t step, int width, int height, void* );
void add16f( const float16_t* src1, size_t step1, const float16_t* src2, size_t step2,
             float16_t* dst, size_t step, int width, int height, void* );
void add16bf( const bfloat16_t* src1, size_t step1, const bfloat16_t* src2, size_t step2,
             bfloat16_t* dst, size_t step, int width, int height, void* );
...

we will have something like (the exact API specification is still in progress)

typedef void (*cv_hal_binary_AopA_t)(const void* src1, size_t step1,
                                    const void* src2, size_t step2,
                                    int width, int height, void*);
void cv_hal_get_add_func(int depth, cv_hal_binary_AopA_t* func);

such transformation will let us to reduce HAL API significantly (for example, we would need over 200 functions to cover just type conversion part) and will also reduce overhead when the primitive function must be called multiple times inside a loop. For example, current add8u() function look like this:

void add8u(...args...) {
    CV_INSTRUMENT_REGION();
    CALL_HAL(add8u, cv_hal_add8u, ...args...)
    ARITHM_CALL_IPP(arithm_ipp_add8u, ...args...)
    CV_CPU_DISPATCH(add8u, (...args...), CV_CPU_DISPATCH_MODES_ALL);
}

that is:

first we try to call external HAL function. If the operation is not supported or there is no external HAL, we get 'not implemented' return code and proceed with other options. When we don't have HAL, an inline function is used here so optimizing compiler will likely throw this step away.
next we try to call IPP
if we did not succeed, we use dispatcher that, depending on the actual instruction set (SSE2, AVX2, AVX512 etc.) calls the proper implementation of add function.

This is quite noticeable overhead, given that modern CPUs can compute a sum of 16 or 32 pairs of uint8_t integers in 0.5 CPU clocks. By changing HAL API to cv_hal_get_..._func() we can reduce the overhead quite significantly. On OpenCV side the dispatcher function may look like:

namespace cv { namespace hal {
cv_hal_binary_AopA_t get_add_func(int depth) {
    CV_INSTRUMENT_REGION();
    cv_hal_binary_AopA_t func = nullptr;
    // try to retrieve function pointer from external HAL, if any
    CALL_HAL(cv_hal_get_add_func, (depth, &func))
    if (!func) {
        // retrieve pointer to the fastest function for the current host CPU.
        CV_CPU_DISPATCH(get_add_func_, (depth, &func), CV_CPU_DISPATCH_MODES_ALL);
    }
    CV_Assert(func != nullptr);
    return func;
}
}}

That is, first we will try to get function pointer from external HAL. In the case of failure we retrieve always available low-level function from OpenCV itself, and we return pointer to the optimal function, depending on hardware. In the function itself we no longer need to place any dispatching code.

For most functions except for primitive element-wise operations we are going to add a protocol to calculate required scratch buffer size, similar to the one used in Lapack library, e.g.:

typedef int (*cv_hal_svd_func_t)(void* A, size_t astep, int arows, int acols, void* W, void* V, size_t vstep,
                                 void* scratchbuf, size_t* scratchbufsize);
void cv_hal_get_svd_func(int depth, cv_hal_svd_func_t*);

SVD function, when scratchbuf=nullptr is passed, does not compute singular value decomposition, but instead calculates and stores the required scratch buffer size in scratchbufsize:

auto svd_64f = cv::hal::get_svd_func(CV_64F);
// compute scratch buf size that is needed for SVD of 1000x1000 FP64 matrix
size_t scratchbufsize = 0;
svd_64f(nullptr, 0, 1000, 1000, nullptr, nullptr, 0, nullptr, &scratchbufsize);

The mechanism of linking and using an external HAL will basically remain the same. That is, external HAL will have to define cv_hal_... macros to override standard stub functions. Potentially it's possible to use several external CPU HAL libraries at once:

// the names of macros here are approximate and used just for illustration
#include "opencv2/core/hal/interface.hpp"
#if CV_USE_EXTERNAL_HAL
INCLUDE_CV_EXTERNAL_HAL_HEADERS
#endif
// _ni stands for 'not implemented'
static inline int hal_get_add_func_ni(int depth, cv_hal_binary_AopA_t* func)
{ *func = nullptr; return CV_HAL_NOT_IMPLEMENTED_ERR; }
#ifndef cv_hal_get_add_func
#define cv_hal_get_add_func hal_get_add_func_ni
#endif

Extra rules for CPU HAL 5.0

The detailed OpenCV 5.0 CPU HAL API will be submitted in a dedicated pull request. Here are some rules that we are going to set for external HAL implementations:

CPU HAL, unlike non-CPU HAL (TBD link) is immediate-mode, mostly single-threaded API. That is, element-wise, filtering and other data-local functions should not use any threading or asyncronous execution mechanisms. It's assumed that OpenCV organizes such threading/pipelining on top of such low-level kernels. There are some exceptions, most notably linear algebra functions, maybe GEMM as well. Of course, CPU HAL functions must be reenterable (threading-friendly). For HAL functions that employ internal parallelism, there should be way to control number of tasks and the method to assign paralell_for implementation, e.g.:
```
typedef void (*cv_hal_parallel_for_body_t)(int start, int end, int nsubtasks, void* userdata);
typedef void (*cv_hal_parallel_for_t)(int nsubtasks, cv_hal_parallel_for_body_t body,
                                                              void* userdata, double ntasks);
// set the new 'parallel for' engine; return the previously set one (if any)
cv_hal_parallel_for_t cv_hal_set_parallel_for(cv_hal_parallel_for_t custom_parallel_for);
```
All functions must operate on supplied memory addresses. There should be no special data alignment requirements. We may introduce special '_aligned' flavors in HAL API later, but generally low-level primitives should run on any provided data pointers, including unaligned cases. On the other hand, it's safe to assume that arrays of 64-bit elements are 8-byte aligned, 32-bit elements are 4-byte aligned and 16-bit elements are at least 2-byte aligned.
CPU HAL should not demand that data buffers are allocated using special vendor-provided API. If there is such requirement, consider creating non-CPU HAL instead.
Once a non-null function pointer is returned by cv_hal_get_..._func(), the provided function must process any supplied data. There is no way for that low-level function to return 'not implemented' and there should be no fallback in OpenCV to handle such a case. That is, a custom HAL may be incomplete in terms of supported types or supported functions, but each provided function must be complete, it should handle all the corner cases (e.g. arrays of just 1 element) properly.

Update: this rule can be relaxed: OpenCV's get(...) may return 2 pointers: accelerated and default. If accelerated function returned "not implemented", then the default function is called:
```
 cv_hal_resize_t f_hal, f0;
 cv_hal_get_resize_bilinear(CV_8UC3, &f_hal, &f0);
 //  CV_CALL_HAL(...) does the following:
 // ({ int retcode = f_hal(args ...); 
 //     if (retcode == CV_HAL_NOT_IMPLEMENTED) retcode = f0(args ...);
 //     retcode })
 CV_CALL_HALL(f_hal, f0, (src.data, src.step, src.rows, src.cols, dst.data, dst.step, dst.rows, dst.cols));
```
Each CPU HAL library implementation may require initialization function (once per process and maybe yet another for once-per-thread initialization). OpenCV should take care of it.
Once a HAL function is introduced, its API is fixed forever. If we need extra functionality, we create cv_hal_..._v2 function. It's up to OpenCV to keep special code branches to use older versions of certain HAL entry. Therefore, introducing/extending HAL specification is a very responsible thing.
With item 6 in mind, we probably need to have scripts to check HAL API immutability and also have a clean HAL API specification somewhere in OpenCV docs.
Testing 3rd-party HAL implementations for accuracy is a separate big topic and is out of scope of this document. The general rule of thumb is that OpenCV unit tests must still pass regardless of the HAL used.
[Update: see item 4. This item is addressed there] Probably, for some accuracy-critical algorithms those cv::hal::get...func() functions should have a flag to always return OpenCV's version of HAL function, even in external HAL presence, e.g. auto trustworthy_resize_8u = cv::hal::get_resize_linear_func(CV_8U, CV_HAL_USE_OPENCV).
Some HAL implementations may provide built-in JIT compiler for more or less simple expressions on arrays, images etc. For example:
- sigmoid(A*x + b) - matrix multiplication with bias and activation
- x + alpha*min(max(x - gaussian(x, sigma), -t), t) - unsharp mask
- [canvas, w] = (1 - alpha)*[canvas, w] + alpha*warpPerspective_with_mask(image_i, transform_i) - image stitching
It would be nice to have some extendible 'language' for such expressions so that HAL may generate code on-fly for them. At least start with element-wise expressions and then extend it to filter + element-wise expressions, image warping + element-wise expressions, matrix multiplication + element-wise expressions. The obvious (non-CPU) examples of such HAL implementations are OpenCL and GLSL, where we have shader language which we can use to form mini-programs on fly. Less obvious, but still popular is NVidia CUDA with its PTX. For CPU we could use Loops: https://github.com/4ekmah/loops.

opencv / opencv

New CPU HAL for OpenCV 5.0 #25019