Closed Kumataro closed 1 month ago
This patch contains performance tuning.
I compare count of Instruction references.
Environment | Ir of write_mat_to_xrgb8888() |
---|---|
OpenCV 4.9(Before) | 5,911,406 |
Without SIMD | 2,453,868 |
SSE3 + SIMD128 | 1.150.789 |
SSE4.1 + SIMD128 | 325,695 |
AVX2 + SIMD256 | 260,544 |
The differences between SSE3 and SSE4.1 comes from intristic implementation.
Source code is here.
// g++ main.cpp -o a.out -I /usr/local/include/opencv4 -lopencv_core -lopencv_highgui -lopencv_imgcodecs
#include <opencv2/core.hpp>
#include <opencv2/highgui.hpp>
#include <opencv2/imgcodecs.hpp>
#include <iostream>
#include <string>
int main(int argc, char *argv[])
{
std::cout << "cv::currentUIFramework() returns " << cv::currentUIFramework() << std::endl;
cv::Mat src;
src = cv::imread("opencv-logo.png");
cv::namedWindow("src");
cv::imshow("src", src);
(void)cv::waitKey(1000);
return 0;
}
Command is here.
valgrind --tool=callgrind ./a.out
callgrind_annotate callgrind.out.[PID] | grep 8888 | head -1
Proposed todo: Add RISC-V RVV and other scalable vector intrinsics support. Need to use CV_SIMD_SCALABLE macro and run-time value step in loops.
Thank you for your proposal ! I'll try it this weekend. I have no ARM SVE and RISC-V Vector Extension environment, but it looks like to able to test with AVX environment.
I think current implementation will be refactoring similar to split function. https://github.com/opencv/opencv/blob/4.x/modules/core/src/split.simd.hpp
For example(this is only my imagination ).
template<typename T, typename VecT> static void
vecwrite_T_to_xrgb8888( const T* src, T* dst, int len, int scn )
{
const int VECSZ = VTraits<VecT>::vlanes();
const int dcn = 4; // XRGB
:
:
else if( scn == 3 )
{
for( i = 0; i < len; i += VECSZ )
{
if( i > len - VECSZ )
{
i = len - VECSZ;
mode = hal::STORE_UNALIGNED;
}
VecT b,g,r;
v_load_deinterleave(src + i*scn, b, g, r);
v_store_interleave (dst + i*dcn, b, g, r, r, mode);
if( i < i0 )
{
i = i0 - VECSZ;
mode = hal::STORE_ALIGNED_NOCACHE;
}
}
}
I update code to support CV_SIMD_SCALABLE and tested with VMWare(AVX2) and Raspi4(NEON) with ubuntu24.04. opencv_test_highgui is passed and it called vector implementation.
This logic is simple. I add AVX512_SKX LASX and RVV because I expected it to be effective.
ocv_add_dispatched_file(write_mat_to_xrgb8888 SSE4_1 AVX2 AVX512_SKX NEON LASX RVV)
I see cvtColor implementation, and I have second idea to use cvtColor(cv::BGR2BGRA) or cvtColor(cv::GRAY2BGRA) instead of this SIMD implementation. I'l try it.
Wayland requests [B8:G8:R8:X8], not [B8:G8:R8:A8]. I thought it is hard to extend cvtColor() to support RGBX for this backend only .
But I notice X channel is not used, it means there are no problem even if it stores non-transparency alpha value. So we can use COLOR_BGR2BGRA2 option for this purpose.
We can get many performance improvemet, which are provided from OpenCL, IPP, multithread, via cvtColor(). And furthermore, the maintainability of the code is also improved.
Close #25550
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request