opencv / opencv

Open Source Computer Vision Library
https://opencv.org
Apache License 2.0
75.95k stars 55.61k forks source link

New CPU HAL for OpenCV 5.0 #25019

Open vpisarev opened 3 months ago

vpisarev commented 3 months ago

Introduction

Concept of a HAL (Hardware Abstraction Layer) is well-known. Basically it means that in some big software package (library of algorithms, operating system, photo editing application etc.) certain much smaller 'performance-critical' or 'hardware-dependent' subset of functionality is identified and shaped as a separate 'HAL' layer in software architecture. Such separation does not just makes the architecture cleaner, it also helps to port software to other hardware or even offload this work to hardware vendors as long as HAL API is well-specified and stable.

In the early days of OpenCV, as it was started at Intel, we used Intel IPP as such informal HAL. Later on we added Carotene - IPP-like acceleration library for ARM platform. At some point we figured out that we need more and more kernels or new variations of existing kernels (since computer vision algorithms evolve quite fast) and so it's inconvenient to create new HAL entries for each such kernel. It's more convenient to optimize hot loops right inside the algorithm. At that time (~2012) we already actively supported two platforms, x86 and ARM, and so we had to optimize each single 'hot' loop twice, using two quite differently looking but semantically very similar sets of SIMD instructions: SSE2 and NEON.

Universal intrinsics

To solve the problem with having to write and maintain multiple copies of a same optimized loop we decided to create light-weight wrappers on top of native intrinsics, which we called Universal intrinsics. We currently support different SIMD or vector architectures via this unified API, including SSE2-SSE4.2, AVX2, AVX512, NEON, VSX (PowerPC), RVV (RISC-V) etc.: https://github.com/opencv/opencv/tree/4.x/modules/core/include/opencv2/core/hal.

With time, universal intrinsics have been gradually extended to 'wide universal intrinsics' (to support wider than 128-bit vector registers, like in AVX2 or AVX512) and then to 'scalable universal intrinsics' to cover SVE2 in ARMv9 and RVV in RISC-V, i.e. architectures where vector register size is unknown at compile time and may vary from one manufacturer to another, one CPU model to another.

Universal intrinsics evolution in OpenCV 5.0

As such, upcoming OpenCV 5.0 or even OpenCV 4.x are quite future-proof in terms of SIMD support. But the following evolutionary changes in Universal Intrinsics API could be put to OpenCV 5.0:

CPU HAL beyond universal intrinsics: the current state

Sometimes the kernels are super performance-critical and they can be implemented most efficiently using some hardware-specific instructions that don't exist on other platforms. For example, ARM v8.6 introduced instruction to compute matrix product of two 2x4 and 4x2 BF16 matrices, which may roughly double the peak performance when computing MatMul or Conv2D deep learning kernels. But in order to make use of this instruction with close-to-peak performance, we need to rearrange input matrices (that can be done on-fly, block-by-block). In some other cases the fastest implementations of certain seemingly 'basic' algorithms, like GEMM or DFT, are very complex and require special external software packages.

In such cases the 'Universal intrinsics' approach will not work. Instead, a special low-level API, similar to IPP or Carotene, should be introduced for such performance-critical kernels. Such kernels should have default implementation inside OpenCV, but it should also be possible to compile the library with custom vendor-provided HAL that would override such low-level kernels (as it's done now when OpenCV is compiled with IPP).

In OpenCV 3.x such IPP-like CPU HAL has been introduced for this purpose. So far it has quite limited functionality:

Besides the official HAL API, OpenCV 4.x also uses IPP directly via conditional compilation. The functionality coverage is roughly the same as with the official HAL API.

CPU HAL beyond universal intrinsics: 5.0 proposal

Since the HAL (beyond universal intrinsics) should have very stable and well-justified API, it's suggested to start with removing the current obsolete HAL API (which cover very little of OpenCV functionality, as described above) and then, after #25011 and #25012 are implemented and profiled, we can introduce this API. Probably, should postpone this part of the feature request till subsequent 5.x releases

In OpenCV 5 we plan to clean the code, and as a part of it, move IPP into a dedicated HAL library, just like Carotene now.

At the same time, we want to establish completely new, but more or less stable HAL API (which we will further extend) that will cover significantly bigger part of Core, Imgproc and DNN modules than now.

Here are the main features of the new HAL:

Extra rules for CPU HAL 5.0

The detailed OpenCV 5.0 CPU HAL API will be submitted in a dedicated pull request. Here are some rules that we are going to set for external HAL implementations:

  1. CPU HAL, unlike non-CPU HAL (TBD link) is immediate-mode, mostly single-threaded API. That is, element-wise, filtering and other data-local functions should not use any threading or asyncronous execution mechanisms. It's assumed that OpenCV organizes such threading/pipelining on top of such low-level kernels. There are some exceptions, most notably linear algebra functions, maybe GEMM as well. Of course, CPU HAL functions must be reenterable (threading-friendly). For HAL functions that employ internal parallelism, there should be way to control number of tasks and the method to assign paralell_for implementation, e.g.:

    typedef void (*cv_hal_parallel_for_body_t)(int start, int end, int nsubtasks, void* userdata);
    typedef void (*cv_hal_parallel_for_t)(int nsubtasks, cv_hal_parallel_for_body_t body,
                                                                  void* userdata, double ntasks);
    // set the new 'parallel for' engine; return the previously set one (if any)
    cv_hal_parallel_for_t cv_hal_set_parallel_for(cv_hal_parallel_for_t custom_parallel_for);
  2. All functions must operate on supplied memory addresses. There should be no special data alignment requirements. We may introduce special '_aligned' flavors in HAL API later, but generally low-level primitives should run on any provided data pointers, including unaligned cases. On the other hand, it's safe to assume that arrays of 64-bit elements are 8-byte aligned, 32-bit elements are 4-byte aligned and 16-bit elements are at least 2-byte aligned.

  3. CPU HAL should not demand that data buffers are allocated using special vendor-provided API. If there is such requirement, consider creating non-CPU HAL instead.

  4. Once a non-null function pointer is returned by cv_hal_get_..._func(), the provided function must process any supplied data. There is no way for that low-level function to return 'not implemented' and there should be no fallback in OpenCV to handle such a case. That is, a custom HAL may be incomplete in terms of supported types or supported functions, but each provided function must be complete, it should handle all the corner cases (e.g. arrays of just 1 element) properly.

    Update: this rule can be relaxed: OpenCV's get(...) may return 2 pointers: accelerated and default. If accelerated function returned "not implemented", then the default function is called:

     cv_hal_resize_t f_hal, f0;
     cv_hal_get_resize_bilinear(CV_8UC3, &f_hal, &f0);
     //  CV_CALL_HAL(...) does the following:
     // ({ int retcode = f_hal(args ...); 
     //     if (retcode == CV_HAL_NOT_IMPLEMENTED) retcode = f0(args ...);
     //     retcode })
     CV_CALL_HALL(f_hal, f0, (src.data, src.step, src.rows, src.cols, dst.data, dst.step, dst.rows, dst.cols));
  5. Each CPU HAL library implementation may require initialization function (once per process and maybe yet another for once-per-thread initialization). OpenCV should take care of it.

  6. Once a HAL function is introduced, its API is fixed forever. If we need extra functionality, we create cv_hal_..._v2 function. It's up to OpenCV to keep special code branches to use older versions of certain HAL entry. Therefore, introducing/extending HAL specification is a very responsible thing.

  7. With item 6 in mind, we probably need to have scripts to check HAL API immutability and also have a clean HAL API specification somewhere in OpenCV docs.

  8. Testing 3rd-party HAL implementations for accuracy is a separate big topic and is out of scope of this document. The general rule of thumb is that OpenCV unit tests must still pass regardless of the HAL used.

  9. [Update: see item 4. This item is addressed there] Probably, for some accuracy-critical algorithms those cv::hal::get...func() functions should have a flag to always return OpenCV's version of HAL function, even in external HAL presence, e.g. auto trustworthy_resize_8u = cv::hal::get_resize_linear_func(CV_8U, CV_HAL_USE_OPENCV).

  10. Some HAL implementations may provide built-in JIT compiler for more or less simple expressions on arrays, images etc. For example:

    • sigmoid(A*x + b) - matrix multiplication with bias and activation
    • x + alpha*min(max(x - gaussian(x, sigma), -t), t) - unsharp mask
    • [canvas, w] = (1 - alpha)*[canvas, w] + alpha*warpPerspective_with_mask(image_i, transform_i) - image stitching

    It would be nice to have some extendible 'language' for such expressions so that HAL may generate code on-fly for them. At least start with element-wise expressions and then extend it to filter + element-wise expressions, image warping + element-wise expressions, matrix multiplication + element-wise expressions. The obvious (non-CPU) examples of such HAL implementations are OpenCL and GLSL, where we have shader language which we can use to form mini-programs on fly. Less obvious, but still popular is NVidia CUDA with its PTX. For CPU we could use Loops: https://github.com/4ekmah/loops.