opencv / opencv

Open Source Computer Vision Library
https://opencv.org
Apache License 2.0
75.95k stars 55.61k forks source link

3rdparty: NDSRVP - A New 3rdparty Library with Optimizations Based on RISC-V P Extension v0.5.2 - Part 1: Basic Functions #25167

Open Junyan721113 opened 2 months ago

Junyan721113 commented 2 months ago

Summary

Previous context

From PR #24556:

As a test outside of this PR, A 3rdparty component called ndsrvp is created, containing one of the non-dnn code (integral_SIMD), and it works very well. All the non-dnn code in this PR have been removed, currently this PR can be focused on dnn optinizations. This HAL mechanism is quite suitable for rvp optimizations, all the non-dnn code is expected to be moved into ndsrvp soon.

Progress

Part 1 (This PR)

Part 2 (Next PR)

Rough Estimate. Todo List May Change.

Performance Tests

The optimization does not contain floating point opreations.

Absolute Difference

Geometric mean (ms)

Name of Test opencv perf core Absdiff opencv perf core Absdiff opencv perf core Absdiff vs opencv perf core Absdiff (x-factor)
Absdiff::OCL_AbsDiffFixture::(640x480, 8UC1) 23.104 5.972 3.87
Absdiff::OCL_AbsDiffFixture::(640x480, 32FC1) 39.500 40.830 0.97
Absdiff::OCL_AbsDiffFixture::(640x480, 8UC3) 69.155 15.051 4.59
Absdiff::OCL_AbsDiffFixture::(640x480, 32FC3) 118.715 120.509 0.99
Absdiff::OCL_AbsDiffFixture::(640x480, 8UC4) 93.001 19.770 4.70
Absdiff::OCL_AbsDiffFixture::(640x480, 32FC4) 161.136 160.791 1.00
Absdiff::OCL_AbsDiffFixture::(1280x720, 8UC1) 69.211 15.140 4.57
Absdiff::OCL_AbsDiffFixture::(1280x720, 32FC1) 118.762 119.263 1.00
Absdiff::OCL_AbsDiffFixture::(1280x720, 8UC3) 212.414 44.692 4.75
Absdiff::OCL_AbsDiffFixture::(1280x720, 32FC3) 367.512 366.569 1.00
Absdiff::OCL_AbsDiffFixture::(1280x720, 8UC4) 285.337 59.708 4.78
Absdiff::OCL_AbsDiffFixture::(1280x720, 32FC4) 490.395 491.118 1.00
Absdiff::OCL_AbsDiffFixture::(1920x1080, 8UC1) 158.827 33.462 4.75
Absdiff::OCL_AbsDiffFixture::(1920x1080, 32FC1) 273.503 273.668 1.00
Absdiff::OCL_AbsDiffFixture::(1920x1080, 8UC3) 484.175 100.520 4.82
Absdiff::OCL_AbsDiffFixture::(1920x1080, 32FC3) 828.758 829.689 1.00
Absdiff::OCL_AbsDiffFixture::(1920x1080, 8UC4) 648.592 137.195 4.73
Absdiff::OCL_AbsDiffFixture::(1920x1080, 32FC4) 1116.755 1109.587 1.01
Absdiff::OCL_AbsDiffFixture::(3840x2160, 8UC1) 648.715 134.875 4.81
Absdiff::OCL_AbsDiffFixture::(3840x2160, 32FC1) 1115.939 1113.818 1.00
Absdiff::OCL_AbsDiffFixture::(3840x2160, 8UC3) 1944.791 413.420 4.70
Absdiff::OCL_AbsDiffFixture::(3840x2160, 32FC3) 3354.193 3324.672 1.01
Absdiff::OCL_AbsDiffFixture::(3840x2160, 8UC4) 2594.585 553.486 4.69
Absdiff::OCL_AbsDiffFixture::(3840x2160, 32FC4) 4473.543 4438.453 1.01

Bitwise Operation

Geometric mean (ms)

Name of Test opencv perf core Bit opencv perf core Bit opencv perf core Bit vs opencv perf core Bit (x-factor)
Bitwise_and::OCL_BitwiseAndFixture::(640x480, 8UC1) 22.542 4.971 4.53
Bitwise_and::OCL_BitwiseAndFixture::(640x480, 32FC1) 90.210 19.917 4.53
Bitwise_and::OCL_BitwiseAndFixture::(640x480, 8UC3) 68.429 15.037 4.55
Bitwise_and::OCL_BitwiseAndFixture::(640x480, 32FC3) 280.168 59.239 4.73
Bitwise_and::OCL_BitwiseAndFixture::(640x480, 8UC4) 90.565 19.735 4.59
Bitwise_and::OCL_BitwiseAndFixture::(640x480, 32FC4) 374.695 79.257 4.73
Bitwise_and::OCL_BitwiseAndFixture::(1280x720, 8UC1) 67.824 14.873 4.56
Bitwise_and::OCL_BitwiseAndFixture::(1280x720, 32FC1) 279.514 59.232 4.72
Bitwise_and::OCL_BitwiseAndFixture::(1280x720, 8UC3) 208.337 44.234 4.71
Bitwise_and::OCL_BitwiseAndFixture::(1280x720, 32FC3) 851.211 182.522 4.66
Bitwise_and::OCL_BitwiseAndFixture::(1280x720, 8UC4) 279.529 59.095 4.73
Bitwise_and::OCL_BitwiseAndFixture::(1280x720, 32FC4) 1132.065 244.877 4.62
Bitwise_and::OCL_BitwiseAndFixture::(1920x1080, 8UC1) 155.685 33.078 4.71
Bitwise_and::OCL_BitwiseAndFixture::(1920x1080, 32FC1) 635.253 137.482 4.62
Bitwise_and::OCL_BitwiseAndFixture::(1920x1080, 8UC3) 474.494 100.166 4.74
Bitwise_and::OCL_BitwiseAndFixture::(1920x1080, 32FC3) 1907.340 412.841 4.62
Bitwise_and::OCL_BitwiseAndFixture::(1920x1080, 8UC4) 635.538 134.544 4.72
Bitwise_and::OCL_BitwiseAndFixture::(1920x1080, 32FC4) 2552.666 556.397 4.59
Bitwise_and::OCL_BitwiseAndFixture::(3840x2160, 8UC1) 634.736 136.355 4.66
Bitwise_and::OCL_BitwiseAndFixture::(3840x2160, 32FC1) 2548.283 561.827 4.54
Bitwise_and::OCL_BitwiseAndFixture::(3840x2160, 8UC3) 1911.454 421.571 4.53
Bitwise_and::OCL_BitwiseAndFixture::(3840x2160, 32FC3) 7663.803 1677.289 4.57
Bitwise_and::OCL_BitwiseAndFixture::(3840x2160, 8UC4) 2543.983 562.780 4.52
Bitwise_and::OCL_BitwiseAndFixture::(3840x2160, 32FC4) 10211.693 2237.393 4.56
Bitwise_not::OCL_BitwiseNotFixture::(640x480, 8UC1) 22.341 4.811 4.64
Bitwise_not::OCL_BitwiseNotFixture::(640x480, 32FC1) 89.975 19.288 4.66
Bitwise_not::OCL_BitwiseNotFixture::(640x480, 8UC3) 67.237 14.643 4.59
Bitwise_not::OCL_BitwiseNotFixture::(640x480, 32FC3) 276.324 58.609 4.71
Bitwise_not::OCL_BitwiseNotFixture::(640x480, 8UC4) 89.587 19.554 4.58
Bitwise_not::OCL_BitwiseNotFixture::(640x480, 32FC4) 370.986 77.136 4.81
Bitwise_not::OCL_BitwiseNotFixture::(1280x720, 8UC1) 67.227 14.541 4.62
Bitwise_not::OCL_BitwiseNotFixture::(1280x720, 32FC1) 276.357 58.076 4.76
Bitwise_not::OCL_BitwiseNotFixture::(1280x720, 8UC3) 206.752 43.376 4.77
Bitwise_not::OCL_BitwiseNotFixture::(1280x720, 32FC3) 841.638 177.787 4.73
Bitwise_not::OCL_BitwiseNotFixture::(1280x720, 8UC4) 276.773 57.784 4.79
Bitwise_not::OCL_BitwiseNotFixture::(1280x720, 32FC4) 1127.740 237.472 4.75
Bitwise_not::OCL_BitwiseNotFixture::(1920x1080, 8UC1) 153.808 32.531 4.73
Bitwise_not::OCL_BitwiseNotFixture::(1920x1080, 32FC1) 627.765 129.990 4.83
Bitwise_not::OCL_BitwiseNotFixture::(1920x1080, 8UC3) 469.799 98.249 4.78
Bitwise_not::OCL_BitwiseNotFixture::(1920x1080, 32FC3) 1893.591 403.694 4.69
Bitwise_not::OCL_BitwiseNotFixture::(1920x1080, 8UC4) 627.724 129.962 4.83
Bitwise_not::OCL_BitwiseNotFixture::(1920x1080, 32FC4) 2529.967 540.744 4.68
Bitwise_not::OCL_BitwiseNotFixture::(3840x2160, 8UC1) 628.089 130.277 4.82
Bitwise_not::OCL_BitwiseNotFixture::(3840x2160, 32FC1) 2521.817 540.146 4.67
Bitwise_not::OCL_BitwiseNotFixture::(3840x2160, 8UC3) 1905.004 404.704 4.71
Bitwise_not::OCL_BitwiseNotFixture::(3840x2160, 32FC3) 7567.971 1627.898 4.65
Bitwise_not::OCL_BitwiseNotFixture::(3840x2160, 8UC4) 2531.476 540.181 4.69
Bitwise_not::OCL_BitwiseNotFixture::(3840x2160, 32FC4) 10075.594 2181.654 4.62
Bitwise_or::OCL_BitwiseOrFixture::(640x480, 8UC1) 22.566 5.076 4.45
Bitwise_or::OCL_BitwiseOrFixture::(640x480, 32FC1) 90.391 19.928 4.54
Bitwise_or::OCL_BitwiseOrFixture::(640x480, 8UC3) 67.758 14.740 4.60
Bitwise_or::OCL_BitwiseOrFixture::(640x480, 32FC3) 279.253 59.844 4.67
Bitwise_or::OCL_BitwiseOrFixture::(640x480, 8UC4) 90.296 19.802 4.56
Bitwise_or::OCL_BitwiseOrFixture::(640x480, 32FC4) 373.972 79.815 4.69
Bitwise_or::OCL_BitwiseOrFixture::(1280x720, 8UC1) 67.815 14.865 4.56
Bitwise_or::OCL_BitwiseOrFixture::(1280x720, 32FC1) 279.398 60.054 4.65
Bitwise_or::OCL_BitwiseOrFixture::(1280x720, 8UC3) 208.643 45.043 4.63
Bitwise_or::OCL_BitwiseOrFixture::(1280x720, 32FC3) 850.042 180.985 4.70
Bitwise_or::OCL_BitwiseOrFixture::(1280x720, 8UC4) 279.363 60.385 4.63
Bitwise_or::OCL_BitwiseOrFixture::(1280x720, 32FC4) 1134.858 243.062 4.67
Bitwise_or::OCL_BitwiseOrFixture::(1920x1080, 8UC1) 155.212 33.155 4.68
Bitwise_or::OCL_BitwiseOrFixture::(1920x1080, 32FC1) 634.985 134.911 4.71
Bitwise_or::OCL_BitwiseOrFixture::(1920x1080, 8UC3) 474.648 100.407 4.73
Bitwise_or::OCL_BitwiseOrFixture::(1920x1080, 32FC3) 1912.049 414.184 4.62
Bitwise_or::OCL_BitwiseOrFixture::(1920x1080, 8UC4) 635.252 132.587 4.79
Bitwise_or::OCL_BitwiseOrFixture::(1920x1080, 32FC4) 2544.471 560.737 4.54
Bitwise_or::OCL_BitwiseOrFixture::(3840x2160, 8UC1) 634.574 134.966 4.70
Bitwise_or::OCL_BitwiseOrFixture::(3840x2160, 32FC1) 2545.129 561.498 4.53
Bitwise_or::OCL_BitwiseOrFixture::(3840x2160, 8UC3) 1910.900 419.365 4.56
Bitwise_or::OCL_BitwiseOrFixture::(3840x2160, 32FC3) 7662.603 1685.812 4.55
Bitwise_or::OCL_BitwiseOrFixture::(3840x2160, 8UC4) 2548.971 560.787 4.55
Bitwise_or::OCL_BitwiseOrFixture::(3840x2160, 32FC4) 10201.407 2237.552 4.56
Bitwise_xor::OCL_BitwiseXorFixture::(640x480, 8UC1) 22.718 4.961 4.58
Bitwise_xor::OCL_BitwiseXorFixture::(640x480, 32FC1) 91.496 19.831 4.61
Bitwise_xor::OCL_BitwiseXorFixture::(640x480, 8UC3) 67.910 15.151 4.48
Bitwise_xor::OCL_BitwiseXorFixture::(640x480, 32FC3) 279.612 59.792 4.68
Bitwise_xor::OCL_BitwiseXorFixture::(640x480, 8UC4) 91.073 19.853 4.59
Bitwise_xor::OCL_BitwiseXorFixture::(640x480, 32FC4) 374.641 79.155 4.73
Bitwise_xor::OCL_BitwiseXorFixture::(1280x720, 8UC1) 67.704 15.008 4.51
Bitwise_xor::OCL_BitwiseXorFixture::(1280x720, 32FC1) 279.229 60.088 4.65
Bitwise_xor::OCL_BitwiseXorFixture::(1280x720, 8UC3) 208.156 44.426 4.69
Bitwise_xor::OCL_BitwiseXorFixture::(1280x720, 32FC3) 849.501 180.848 4.70
Bitwise_xor::OCL_BitwiseXorFixture::(1280x720, 8UC4) 279.642 59.728 4.68
Bitwise_xor::OCL_BitwiseXorFixture::(1280x720, 32FC4) 1129.826 242.880 4.65
Bitwise_xor::OCL_BitwiseXorFixture::(1920x1080, 8UC1) 155.585 33.354 4.66
Bitwise_xor::OCL_BitwiseXorFixture::(1920x1080, 32FC1) 634.090 134.995 4.70
Bitwise_xor::OCL_BitwiseXorFixture::(1920x1080, 8UC3) 474.931 99.598 4.77
Bitwise_xor::OCL_BitwiseXorFixture::(1920x1080, 32FC3) 1910.519 413.138 4.62
Bitwise_xor::OCL_BitwiseXorFixture::(1920x1080, 8UC4) 635.026 135.155 4.70
Bitwise_xor::OCL_BitwiseXorFixture::(1920x1080, 32FC4) 2560.167 560.838 4.56
Bitwise_xor::OCL_BitwiseXorFixture::(3840x2160, 8UC1) 634.893 134.883 4.71
Bitwise_xor::OCL_BitwiseXorFixture::(3840x2160, 32FC1) 2548.166 560.831 4.54
Bitwise_xor::OCL_BitwiseXorFixture::(3840x2160, 8UC3) 1911.392 419.816 4.55
Bitwise_xor::OCL_BitwiseXorFixture::(3840x2160, 32FC3) 7646.634 1677.988 4.56
Bitwise_xor::OCL_BitwiseXorFixture::(3840x2160, 8UC4) 2560.637 560.805 4.57
Bitwise_xor::OCL_BitwiseXorFixture::(3840x2160, 32FC4) 10227.044 2249.458 4.55

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

asmorkalov commented 2 months ago

@Junyan721113 Could you add more details on hardware configuration and how to reproduce the result?

Junyan721113 commented 2 months ago

@Junyan721113 Could you add more details on hardware configuration and how to reproduce the result?

No problem, here are the details for accuracy and performance tests.

RISC-V P Extension v0.5.2

Env

export RISCV=/opt/andes
export PATH=$PATH:/opt/andes/bin

Toolchain

Prebuilt Releases: Andes-Development-Kit

Suggested Version: v5_1_1

nds-gnu-toolchain

./build_linux_toolchain.sh

TARGET=riscv64-linux
PREFIX=/opt/andes
ARCH=rv64imafdcxandes
ABI=lp64d
CPU=andes-25-series
XLEN=64
BUILD=`pwd`/build-nds64le-linux-glibc-v5d

Qemu

qemu

shell ./build

../configure --prefix=/opt/andes --target-list=riscv32-linux-user,riscv64-linux-user --disable-werror --static

Board

The development board used for performance tests is TinkerV with Andes AX45.

Upload the installed toolchain's sysroot at /opt/andes/sysroot, or the prebuilt releases above.

/etc/ld.so.conf

include /etc/ld.so.conf.d/*.conf
/path/to/the/sysroot/library

shell

ldconfig -v

After that the sysroot library should appear in the result.

OpenCV Test

shell ./build

cmake -D CMAKE_BUILD_TYPE=Debug -D CMAKE_INSTALL_PREFIX=/opt/andes -D BUILD_SHARED_LIBS=OFF -D CMAKE_TOOLCHAIN_FILE=../platforms/linux/riscv64-andes-gcc.toolchain.cmake ..

Qemu

shell ./build/bin

qemu-riscv64 -cpu andes-ax25 -L /opt/andes/sysroot opencv_test_core

Board

Directly upload and run the test, and it would perform properly.

Junyan721113 commented 1 month ago

Considering the Todo List of this PR might be too long, would it be better to divide this PR into smaller ones?

mshabunin commented 1 month ago

@Junyan721113 , you can finalize current state more or less (HAL integration, several core functions implementation). And extend supported functions list in future PRs.

Junyan721113 commented 3 weeks ago

Considering the relation between HAL functions, this PR might be ready for review now. The optimizations mainly contains the following functions:

The rest of HAL functions are related to convolution, thus left for another PR.

Junyan721113 commented 3 weeks ago

Besides, I've noticed that some optimizations could be better if several functions required is also opened as HAL interface, such as:

Meanwhile, I wonder how will the HAL inferface change in the coming OpenCV 5.0. The changes may affect the next PR related to this 3rdparty library.

asmorkalov commented 3 weeks ago

@Junyan721113 Thanks a lot for the contribution!

Junyan721113 commented 3 weeks ago

@Junyan721113 Thanks a lot for the contribution!

  • AutoBuffer may be achieved by simple combination of new and malloca. Not sure, if we need expose it.
  • Remap was added to HAL interface a week ago: New HAL API for remap #25399. You are welcome to contribute RISC-V implementation.

Thank you! This helps me a lot.

Junyan721113 commented 3 weeks ago

@Junyan721113 Thanks a lot for the contribution!

  • AutoBuffer may be achieved by simple combination of new and malloca. Not sure, if we need expose it.
  • Remap was added to HAL interface a week ago: New HAL API for remap #25399. You are welcome to contribute RISC-V implementation.

The mentioned PR contains cv_hal_remap32f, how about adding cv_hal_remap8u cv_hal_remap8s cv_hal_remap16u cv_hal_remap16s? Float32 interface might not be helpful to RVP.

@Junyan721113 , you can finalize current state more or less (HAL integration, several core functions implementation). And extend supported functions list in future PRs.

Meanwhile, the to-do list of "Part 1" is finished, other new features will be in "Part 2". This PR is ready for review now.

asmorkalov commented 3 weeks ago

32f stands to mapx and mapy are floats, but bot fixed point. source and destination may be any OpenCV supported type. Sorry for the confusion.

mshabunin commented 2 weeks ago

Currently there are several warnings regarding strict aliasing in the new HAL library (warpAffine and warpPerspective). Are they serious issues or not? Can we somehow avoid these constructions (maybe with some reinterpret intrinsics)?

/work/opencv/3rdparty/ndsrvp/src/warpAffine.cpp: In member function 'virtual void cv::NdsrvpWarpAffineInvoker::operator()(const cv::Range&) const':
/opencv/3rdparty/ndsrvp/src/warpAffine.cpp:58:76: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
   58 |                             *(uint16x4_t*)(xy + x1 * 2) = __nds__v_pkbb16(*(uint16x4_t*)&vY, *(uint16x4_t*)&vX);
      |                                                                            ^~~~~~~~~~~~~~~~
/opencv/3rdparty/ndsrvp/src/warpAffine.cpp:58:95: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
   58 |                             *(uint16x4_t*)(xy + x1 * 2) = __nds__v_pkbb16(*(uint16x4_t*)&vY, *(uint16x4_t*)&vX);
      |                                                                                               ^~~~~~~~~~~~~~~~
/opencv/3rdparty/ndsrvp/src/warpAffine.cpp:82:76: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
   82 |                             *(uint16x4_t*)(xy + x1 * 2) = __nds__v_pkbb16(*(uint16x4_t*)&vy, *(uint16x4_t*)&vx);
      |                                                                            ^~~~~~~~~~~~~~~~
/opencv/3rdparty/ndsrvp/src/warpAffine.cpp:82:95: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
   82 |                             *(uint16x4_t*)(xy + x1 * 2) = __nds__v_pkbb16(*(uint16x4_t*)&vy, *(uint16x4_t*)&vx);
      |                                                                                               ^~~~~~~~~~~~~~~~
Junyan721113 commented 1 week ago

Currently there are several warnings regarding strict aliasing in the new HAL library (warpAffine and warpPerspective). Are they serious issues or not? Can we somehow avoid these constructions (maybe with some reinterpret intrinsics)?

/work/opencv/3rdparty/ndsrvp/src/warpAffine.cpp: In member function 'virtual void cv::NdsrvpWarpAffineInvoker::operator()(const cv::Range&) const':
/opencv/3rdparty/ndsrvp/src/warpAffine.cpp:58:76: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
   58 |                             *(uint16x4_t*)(xy + x1 * 2) = __nds__v_pkbb16(*(uint16x4_t*)&vY, *(uint16x4_t*)&vX);
      |                                                                            ^~~~~~~~~~~~~~~~
/opencv/3rdparty/ndsrvp/src/warpAffine.cpp:58:95: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
   58 |                             *(uint16x4_t*)(xy + x1 * 2) = __nds__v_pkbb16(*(uint16x4_t*)&vY, *(uint16x4_t*)&vX);
      |                                                                                               ^~~~~~~~~~~~~~~~
/opencv/3rdparty/ndsrvp/src/warpAffine.cpp:82:76: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
   82 |                             *(uint16x4_t*)(xy + x1 * 2) = __nds__v_pkbb16(*(uint16x4_t*)&vy, *(uint16x4_t*)&vx);
      |                                                                            ^~~~~~~~~~~~~~~~
/opencv/3rdparty/ndsrvp/src/warpAffine.cpp:82:95: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
   82 |                             *(uint16x4_t*)(xy + x1 * 2) = __nds__v_pkbb16(*(uint16x4_t*)&vy, *(uint16x4_t*)&vx);
      |                                                                                               ^~~~~~~~~~~~~~~~

It was a mistake. They've been replaced with safer explicit type conversions.

Junyan721113 commented 1 day ago

Strict-aliasing warnings have been fixed. Are there any other suggested changes?