opencv / opencv

Open Source Computer Vision Library
https://opencv.org
Apache License 2.0
76.54k stars 55.64k forks source link

core (opencl): CLBlast integration via dyanmic loading #25568

Open fengyuentau opened 1 month ago

fengyuentau commented 1 month ago

Second commit is all about auto-generated code.

Usage

Get CLBlast:

git clone https://github.com/CNugteren/CLBlast
cmake -B build -S CLBlast -DCMAKE_INSTALL_PREFIX=build/install
cmake --build build --target install -j8

Test with this patch:

git clone https://github.com/fengyuentau/opencv
cd opencv
git checkout clblast_integration

export CLBLAST_INSTALL_DIR=/abs/path/to/CLBLAST-build/install
cmake -B build -DWITH_OPENCL=ON .
cmake --build build --target opencv_test_core opencv_perf_core -j8

export LD_LIBRARY_PATH=/abs/path/to/CLBLAST-build/install/lib # Use DYLD_LIBRARYPATH on macOS
./build/bin/opencv_test_core --gtest_filter="*OCL_*Gemm*"
./build/bin/opencv_perf_core --gtest_filter="*OCL_GemmFixture_Gemm*"

Performance

Usage example:

python opencv/modules/ts/misc/summary.py opencv_perf_core.gtx1080ti.xml opencv_perf_core.gtx1080ti.clblast.xml

Khadas VIM4 (8GB mem, 32GB disk space) with Mali G52 r1p0

Geometric mean (ms)

                        Name of Test                            opencv            opencv                opencv
                                                                 perf              perf                  perf
                                                             core.mali-g52 core.mali-g52.clblast core.mali-g52.clblast
                                                                                                          vs
                                                                                                        opencv
                                                                                                         perf
                                                                                                     core.mali-g52
                                                                                                      (x-factor)
Gemm::OCL_GemmFixture::(640x640, 0, 32FC1)                      40.510            24.351                 1.66
Gemm::OCL_GemmFixture::(640x640, 0, 32FC2)                      99.486            160.065                0.62
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC1)               42.447            23.615                 1.80
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC2)               102.918           100.531                1.02
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC1)      43.153            24.388                 1.77
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC2)      103.611           99.365                 1.04
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC1)               42.378            25.265                 1.68
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC2)               102.890           156.734                0.66
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC1)      38.043            21.727                 1.75
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC2)      94.870            150.405                0.63
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC1)               36.955            21.274                 1.74
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC2)               93.829            153.478                0.61
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC1)                    290.018           147.040                1.97
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC2)                    776.815           592.293                1.31
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC1)             294.465           146.519                2.01
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC2)             784.642           588.987                1.33
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC1)    295.935           145.909                2.03
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC2)    788.559           590.310                1.34
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC1)             294.613           148.811                1.98
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC2)             784.563           594.052                1.32
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC1)    280.617           137.701                2.04
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC2)    758.959           571.672                1.33
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC1)             278.827           136.011                2.05
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC2)             755.590           567.531                1.33

Macbook Air M1 (16GB mem, 512GB disk space)

Accuracy problem with scale >= 1280, but it is ok with scal = 1024.

Geometric mean (ms)

                        Name of Test                         opencv      opencv          opencv
                                                              perf        perf            perf
                                                             core.m1 core.m1.clblast core.m1.clblast
                                                                                           vs
                                                                                         opencv
                                                                                          perf
                                                                                         core.m1
                                                                                       (x-factor)
Gemm::OCL_GemmFixture::(640x640, 0, 32FC1)                    2.248       2.257           1.00
Gemm::OCL_GemmFixture::(640x640, 0, 32FC2)                    9.272       9.889           0.94
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC1)             2.438       2.714           0.90
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC2)             9.434       9.708           0.97
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC1)    2.910       2.764           1.05
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC2)   10.068       8.795           1.14
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC1)             2.585       2.812           0.92
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC2)             9.563       9.202           1.04
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC1)    2.756       2.568           1.07
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC2)    9.506       9.080           1.05
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC1)             2.887       2.640           1.09
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC2)             9.897       9.642           1.03
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC1)                 25.201      23.861           1.06
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC2)                 107.464     107.136          1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC1)          26.242      26.826           0.98
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC2)          108.138     108.599          1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC1) 27.284      27.497           0.99
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC2) 107.704     108.396          0.99
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC1)          26.712      26.136           1.02
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC2)          108.275     108.282          1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC1) 26.257      27.556           0.95
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC2) 109.048     109.098          1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC1)          25.408      25.929           0.98
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC2)          108.337     107.886          1.00

PC with i7-12700K (64GB mem, 1T disk space) with Intel(R) UHD Graphics 770

Accuracy problem with complex (type CV_32FC2).

Geometric mean (ms)

                        Name of Test                           opencv          opencv              opencv
                                                                perf            perf                perf
                                                             core.uhd770 core.uhd770.clblast core.uhd770.clblast
                                                                                                     vs
                                                                                                   opencv
                                                                                                    perf
                                                                                                 core.uhd770
                                                                                                 (x-factor)
Gemm::OCL_GemmFixture::(640x640, 0, 32FC1)                      1.191           1.185               1.01
Gemm::OCL_GemmFixture::(640x640, 0, 32FC2)                      9.739           9.740               1.00
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC1)               1.522           1.525               1.00
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC2)               9.859           9.851               1.00
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC1)      3.854           3.866               1.00
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC2)      9.948           9.919               1.00
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC1)               1.522           1.522               1.00
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC2)               9.863           9.803               1.01
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC1)      1.536           1.529               1.00
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC2)      9.819           9.810               1.00
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC1)               1.177           1.178               1.00
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC2)               9.735           9.735               1.00
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC1)                    9.300           9.314               1.00
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC2)                   77.424          77.427               1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC1)            12.225          12.245               1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC2)            78.307          78.342               1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC1)   30.315          30.172               1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC2)   78.971          79.028               1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC1)            11.066          10.987               1.01
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC2)            78.249          78.211               1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC1)   11.065          11.014               1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC2)   78.147          78.144               1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC1)             9.368           9.342               1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC2)            77.360          77.323               1.00

PC with GTX 1080 Ti (12GB gpu mem, CUDA 12.3)

Geometric mean (ms)

                        Name of Test                             opencv             opencv                 opencv
                                                                  perf               perf                   perf
                                                             core.gtx1080ti core.gtx1080ti.clblast core.gtx1080ti.clblast
                                                                                                             vs
                                                                                                           opencv
                                                                                                            perf
                                                                                                       core.gtx1080ti
                                                                                                         (x-factor)
Gemm::OCL_GemmFixture::(640x640, 0, 32FC1)                       0.338              0.310                   1.09
Gemm::OCL_GemmFixture::(640x640, 0, 32FC2)                       0.650              0.483                   1.34
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC1)                0.443              0.308                   1.44
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC2)                0.822              0.484                   1.70
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC1)       0.545              0.287                   1.90
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC2)       0.976              0.517                   1.89
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC1)                0.435              0.292                   1.49
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC2)                0.819              0.499                   1.64
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC1)       0.399              0.294                   1.35
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC2)       0.756              0.503                   1.50
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC1)                0.337              0.309                   1.09
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC2)                0.659              0.482                   1.37
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC1)                     2.211              1.349                   1.64
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC2)                     4.375              3.551                   1.23
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC1)              2.413              0.979                   2.47
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC2)              4.838              3.054                   1.58
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC1)     2.531              1.203                   2.10
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC2)     5.295              3.501                   1.51
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC1)              2.352              1.380                   1.70
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC2)              4.795              3.876                   1.24
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC1)     2.294              1.391                   1.65
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC2)     4.892              3.904                   1.25
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC1)              2.031              1.202                   1.69
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC2)              4.235              3.545                   1.19

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

force_builders=Linux OpenCL
fengyuentau commented 1 month ago

Observed problems:

  1. On Intel i7-12700K with Intel(R) UHD Graphics 770: clblast has accuracy problem with complex (type CV_32FC2).
  2. on Apple M1: clblast has accuracy problem if scale >= 1280, but it is ok with scale = 1024.
vpisarev commented 1 month ago

@fengyuentau, from the patch I can conclude that we need only a small portion of clblast. Can we extract a subset of clblast and put it to opencv/3rdparty and link it to OpenCV? (i.e. don't use dynamic loading, which is much less convenient for end users). Also, I believe, we need to solve problems with mac and intel somehow. I remember you said (and also see it from the performance charts) that the current Intel version of gemm in OpenCV is faster than clblast, maybe we should keep Intel version.

asmorkalov commented 1 month ago

@fengyuentau Thanks a lot for the effort! The PR was discussed on OpenCV Core team meeting and conclusion is the following:

fengyuentau commented 1 month ago

we have troubles with the most popular platforms: Intel and Apple ARM.

I have done several testings on the clblast accuracy problem. It turns out clblast with tuning results on these platform gives incorrect results, and after reverting those tuning results it can give the correct results. See my repo for testing: https://github.com/fengyuentau/test-clblast.

asmorkalov commented 2 weeks ago

@fengyuentau What is the PR status? What are the next steps here?

fengyuentau commented 1 week ago

we have troubles with the most popular platforms: Intel and Apple ARM.

I have done several testings on the clblast accuracy problem. It turns out clblast with tuning results on these platform gives incorrect results, and after reverting those tuning results it can give the correct results. See my repo for testing: https://github.com/fengyuentau/test-clblast.

Upstream has fixed the accuracy problem both on Intel GPU and Apple M1. Performance results are updated.

fengyuentau commented 1 week ago

@fengyuentau What is the PR status? What are the next steps here?

@asmorkalov We may need to discuss once again whether the integration should be done in the way of dynamic loading or not, since the library itself is updated quite often with tuned parameters on different platforms. It has steady APIs and if the integration is done via dynamic loading, users just need to upgrade CLBlast and do not need to re-build OpenCV.

fengyuentau commented 6 days ago

Decided to drop dynamic loading. Will submit a new pull request to build opencv with clblast.