opencv / opencv

Open Source Computer Vision Library
https://opencv.org
Apache License 2.0
76.71k stars 55.65k forks source link

dnn: parallelize nary elementwise forward implementation & enable related conformance tests #25630

Open fengyuentau opened 1 month ago

fengyuentau commented 1 month ago

This PR introduces the following changes:

Performance

i7-12700K, RAM 64GB, Ubuntu 22.04

Geometric mean (ms)

                Name of Test                     opencv        opencv        opencv
                                                  perf          perf          perf
                                              core.x64.0606 core.x64.0606 core.x64.0606
                                                                               vs
                                                                             opencv
                                                                              perf
                                                                          core.x64.0606
                                                                           (x-factor)
NCHW_C_sum::Layer_NaryEltwise::OCV/CPU           16.116        11.161         1.44
NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU        17.469        11.446         1.53
NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU        17.531        11.469         1.53
NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU      28.653        13.682         2.09
NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU    21.899        13.422         1.63
NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU       21.738        13.185         1.65
NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU        16.172        11.473         1.41
NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU       16.309        11.565         1.41
NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU        16.166        11.454         1.41
NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU        16.157        11.443         1.41
NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU        163.459       15.234         10.73
NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU    10.880        10.868         1.00
NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU    10.947        11.058         0.99
NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU    10.948        10.910         1.00
NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU    10.874        10.871         1.00
NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU    10.971        10.920         1.00
NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU        17.546        11.462         1.53
NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU        16.175        11.475         1.41
NHWC_C::Layer_NaryEltwise::OCV/CPU               11.339        11.333         1.00
NHWC_H::Layer_NaryEltwise::OCV/CPU               16.154        11.102         1.46

Apple M1, RAM 16GB, macOS 14.4.1

Geometric mean (ms)

                Name of Test                     opencv          opencv             opencv      
                                                  perf            perf               perf       
                                              core.m1.0606 core.m1.0606.patch core.m1.0606.patch
                                                                                      vs        
                                                                                    opencv      
                                                                                     perf       
                                                                                 core.m1.0606   
                                                                                  (x-factor)    
NCHW_C_sum::Layer_NaryEltwise::OCV/CPU           28.418          3.768               7.54       
NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU        6.942           5.679               1.22       
NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU        5.822           5.653               1.03       
NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU      5.751           5.628               1.02       
NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU    5.797           5.599               1.04       
NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU       7.272           5.578               1.30       
NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU        5.777           5.562               1.04       
NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU       5.819           5.559               1.05       
NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU        5.830           5.574               1.05       
NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU        5.759           5.567               1.03       
NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU       342.260          74.655              4.58       
NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU    8.338           8.280               1.01       
NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU    8.359           8.309               1.01       
NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU    8.412           8.295               1.01       
NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU    8.380           8.297               1.01       
NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU    8.356           8.323               1.00       
NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU        6.818           5.561               1.23       
NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU        5.805           5.570               1.04       
NHWC_C::Layer_NaryEltwise::OCV/CPU               3.834           4.817               0.80       
NHWC_H::Layer_NaryEltwise::OCV/CPU               28.402          3.771               7.53

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

asmorkalov commented 2 weeks ago

My results with Jetson tk1 (armv7+neon):

ubuntu@jetson1:~/Projects/perf-dnn$ python3 ../opencv/modules/ts/misc/summary.py ./4.x-1.xml ./patched-1.xml | grep NaryEltwise
NCHW_C_sum::Layer_NaryEltwise::OCV/CPU                                                                                                          65.891   43.371      1.52   
NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU                                                                                                       79.287   81.868      0.97   
NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU                                                                                                      187.457   187.657     1.00   
NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU                                                                                                     88.643   96.376      0.92   
NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU                                                                                                   88.694   96.035      0.92   
NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU                                                                                                      88.716   90.298      0.98   
NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU                                                                                                       84.722   83.976      1.01   
NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU                                                                                                      92.757   81.105      1.14   
NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU                                                                                                       84.285   84.010      1.00   
NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU                                                                                                       78.594   78.574      1.00   
NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU                                                                                                      3407.037 3475.724     0.98   
NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU                                                                                                  189.651   189.454     1.00   
NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU                                                                                                   87.859   87.771      1.00   
NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU                                                                                                   87.915   88.053      1.00   
NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU                                                                                                   84.077   84.063      1.00   
NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU                                                                                                   85.160   84.625      1.01   
NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU                                                                                                       86.368   79.089      1.09   
NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU                                                                                                       89.897   78.993      1.14   
NHWC_C::Layer_NaryEltwise::OCV/CPU                                                                                                              77.220   71.425      1.08   
NHWC_H::Layer_NaryEltwise::OCV/CPU                                                                                                              67.494   42.832      1.58
asmorkalov commented 2 weeks ago

My results for Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz (no AVX2):

NCHW_C_sum::Layer_NaryEltwise::OCV/CPU                                                                                                          24.193   17.846      1.36   
NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU                                                                                                       24.026   23.313      1.03   
NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU                                                                                                       27.370   23.279      1.18   
NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU                                                                                                     35.025   23.254      1.51   
NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU                                                                                                   32.455   23.260      1.40   
NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU                                                                                                      32.509   23.321      1.39   
NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU                                                                                                       23.997   23.262      1.03   
NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU                                                                                                      24.038   23.270      1.03   
NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU                                                                                                       23.977   23.269      1.03   
NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU                                                                                                       23.927   23.279      1.03   
NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU                                                                                                      320.598   98.029      3.27   
NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU                                                                                                   24.507   24.488      1.00   
NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU                                                                                                   24.484   24.477      1.00   
NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU                                                                                                   24.500   24.471      1.00   
NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU                                                                                                   24.486   24.482      1.00   
NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU                                                                                                   24.472   24.476      1.00   
NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU                                                                                                       23.953   23.281      1.03   
NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU                                                                                                       23.992   23.274      1.03   
NHWC_C::Layer_NaryEltwise::OCV/CPU                                                                                                              18.260   18.489      0.99   
NHWC_H::Layer_NaryEltwise::OCV/CPU                                                                                                              24.182   17.829      1.36
fengyuentau commented 2 weeks ago

Thank you @asmorkalov for adding more performance results :)

fengyuentau commented 2 weeks ago

Any review comments?

asmorkalov commented 1 week ago

The patch leads to significant OpenCL pipelines degradation, e.g.:

VIT_B_32::DNNTestNetwork::OCV/CPU   149.576     191.409     0.78
VIT_B_32::DNNTestNetwork::OCV/OCL   104.428     445.013     0.23
VIT_B_32::DNNTestNetwork::OCV/OCL_FP16  102.505     442.994     0.23 

I use NVIDIA GF 1080 for benchmark. Looks like the patch prevents some graph fusing or some inference optimization. Looking into details, if it really caused by the PR.

fengyuentau commented 1 week ago

The patch leads to significant OpenCL pipelines degradation, e.g.:

VIT_B_32::DNNTestNetwork::OCV/CPU     149.576     191.409     0.78
VIT_B_32::DNNTestNetwork::OCV/OCL     104.428     445.013     0.23
VIT_B_32::DNNTestNetwork::OCV/OCL_FP16    102.505     442.994     0.23 

I use NVIDIA GF 1080 for benchmark. Looks like the patch prevents some graph fusing or some inference optimization. Looking into details, if it really caused by the PR.

Ok, I will take a look at the problem.

fengyuentau commented 5 days ago

@asmorkalov The performance "degradation" is due to very out-of-date code base (>450 commits behind 4.x). I have updated the code base. Performance testings (on Intel UHD 770) seem to be okay on my side. Feel free to retest on your side.


Thinking positively, we have achieved a lot performance boosting from those commits (OCL is ~4x faster and CPU is ~1.3x faster). Maybe I can add the OCL backend for this layer later :)

asmorkalov commented 15 hours ago

perf-dnn.zip OpenCL related degradation disappeared. Perf numbers for updated PR for core i5-2500:

NCHW_C_sum::Layer_NaryEltwise::OCV/CPU  24.142  17.999  1.34
NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU   23.860  23.265  1.03
NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU   27.383  23.282  1.18
NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU     39.056  23.292  1.68
NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU   32.489  23.290  1.39
NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU  32.435  23.257  1.39
NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU   23.966  23.269  1.03
NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU  23.992  23.276  1.03
NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU   23.951  23.273  1.03
NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU   23.862  23.272  1.03
NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU   320.265     97.879  3.27
NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU   24.491  24.487  1.00
NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU   24.463  24.464  1.00
NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU   24.472  24.465  1.00
NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU   24.460  24.453  1.00
NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU   24.463  24.530  1.00
NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU   23.870  23.271  1.03
NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU   23.964  23.764  1.01
NHWC_C::Layer_NaryEltwise::OCV/CPU  18.083  18.458  0.98
NHWC_H::Layer_NaryEltwise::OCV/CPU  24.140  17.857  1.35