Closed YashasSamaga closed 4 years ago
Do I have to use CV_OVERRIDE
and CV_FINAL
? I preassume that they were added for portability but now since both final
and override
are keywords in C++11, should they be used?
Can I use std::shared_ptr
instead of cv::Ptr
? There isn't a make_shared
equivalent and makePtr
doesn't do what std::make_shared
does.
Is it fine to force push occasionally when there isn't any dependent stuff like reviews in between?
CV_OVERRIDE and CV_FINAL
It is used to avoid excessive merge issues from 3.4
branch.
As your code is in master branch only and this problem is not actual, so you can use C++ keywords/modifiers.
use std::shared_ptr instead of cv::Ptr
Feel free to use std::shared_ptr
(but it is not supported by bindings generator, so be careful with public API).
makePtr doesn't do what std::make_shared does.
In master branch it is just a wrapper, so it should do the same things.
Is it fine to force push
It is OK. Also rebasing is preferred over "merge" commits (it is easy to do that using 1 squashed commit: squash first, then rebase).
Seems like it would be implementation defined at worst, rather than UB. You sure it’s UB? If it’s ok in c++17 and works in our case I think it’s fine. I would be surprised if some compilers defined std::iterator_traits<T>::iterator_category
for non iterators in c++11.
First the good (or great): I have done some tests and I can build opencv and successfully run both test and perf on my workstation with an "pascal" generation card (compute capability 6.1) and cuda 10.1 with gcc-8 as compiler. It does generate quite some warnings like:
In file included from /usr/local/cuda/include/cuda_fp16.h:2524:0, from /mnt/storage/src/opencv/opencv-cudadnn-buildtest/modules/dnn/src/cuda/activations.cu:6: /usr/local/cuda/include/cuda_fp16.hpp:279:6: warning: "__CUDA_ARCH__" is not defined, evaluates to 0 [-Wundef]
if (CUDA_ARCH >= 530 || !defined(CUDA_ARCH)) && !defined(__CUDA_NO_HALF2_OPERATORS__)
^
~~~~
The included "cuda_fp16.h" and "cuda_fp16.hpp" have a lot of __CUDA_ARCH__ ifdef'ery around fp16 datatypes support for different compute capabilities. So somehow nvcc doesn't seem to have that macro defined while compiling the .cu files. However if i read the manual it states:
5.7.4. Virtual Architecture Identification Macro The architecture identification macro __CUDA_ARCH__ is assigned a three-digit value string xy0 (ending in a literal 0) during each nvcc compilation stage 1 that compiles for compute_xy. This macro can be used in the implementation of GPU functions for determining the virtual architecture for which it is currently being compiled. The host code (the non-GPU code) must not depend on it.
It looks likes it should have been defined, as this is Device and not host code ? However it doesn't seem to impact the build for this card generation / computer capability.
And now the somewhat less good: Trying to build on a nvidia jetson nano (compute capability 5.3) fails, nvidia currently only provides an image with cuda 10.0 cudnn 7.3.1 and gcc-7 as compiler. So i'm stuck on that for the moment.
I have attached the CMakeVars.txt and the complete build logs for both machines.
pascal: opencv-cudadnn-buildtest-pascal-gcc.txt CMakeVars.txt nano: opencv-cudadnn-buildtest-jetsonnano-gcc.txt CMakeVars.txt
@applied-machinelearning That's surprising. I have two tests failing on my PC (and I expect them to fail on any PC).
The warnings are emitted by the CUDA headers which are broken (I think so). That's what I can infer from this NVIDIA DevTalk post.
The PR uses tensor transform API for adding asymmetric padding (required for same
padding mode) in the convolution layer. This API was added in cuDNN starting from version 7.5.0 (Release Notes). Hence, the build is failing on nano.
The minimum version of cuDNN required is 7.5.0. The CMake doesn't reflect this yet [TODO].
@davisking what should be the minimum version of cuDNN supported? The latest version is 7.6.2 and the current minimum is 7.5.0.
@alalek Is it possible to run the accuracy and performance tests on CI? The cuda build doesn't seem to run the tests.
There are two warnings from an unrelated module:
/build/precommit_custom_linux/opencv_contrib/modules/cudalegacy/src/cuda/NCVBroxOpticalFlow.cu(1124): warning: variable "p_threads" was declared but never referenced
/build/precommit_custom_linux/opencv_contrib/modules/cudalegacy/src/cuda/NCVBroxOpticalFlow.cu(958): warning: variable "dThreads" was declared but never referenced
@applied-machinelearning That's surprising. I have two tests failing on my PC (and I expect them to fail on any PC).
Sorry for not being precise, there are some individual tests failing, but the test and perf scripts finish without segmentation faults and other grave errors.
The minimum version of cuDNN required is 7.5.0. The CMake doesn't reflect this yet [TODO].
Thank you for clearing that up, although it's a pity since at the moment that would rule out the Jetson embedded boards for which Nvidia also doesn't support openCL. Unfortunately Nvidia doesn't provide individual downloads/releases for aarch64 for it's libraries. Hopefully they will release new images with cuda 10.1 and an updated cuDNN anytime soon.
@YashasSamaga Great progress!
Is it possible to run the accuracy and performance tests on CI
No, we don't have CUDA GPUs in CI. One of the problems is prohibition of NVIDIA drivers installation for normal GPUs:
two warnings from an unrelated module
just ignore them (or feel free to prepare separate PR which suppresses/eliminates these warnings, like #15267)
I think requiring 7.5.0 is fine.
NOTE: the devices used in the test are low-end mobile devices
CPU: i7 7700HQ GPU: NVIDIA GTX 1050 Mobile
BLAS Library: MKL 2019.0.4 CUDA Version: 10.1 cuDNN: 7.6.2
Warmup Runs: 3 (forward pass is performed three times before benchmarks) Benchmark Runs: 10 (the average of ten forward passes is reported)
Test Code: https://gist.github.com/YashasSamaga/71157cf0c3768c497e5e70fb95435596
Notes:
Model | CUDA FP32 | Inference Engine CPU | OpenCV CPU |
---|---|---|---|
GoogLeNet | 7.2447ms | 10.4981ms | 17.9176ms |
DenseNet121 | 12.6324ms | 19.1823ms | 48.0628ms |
EAST Text Detection | 18.8281ms | 49.0508ms | 88.9429ms |
ENet | 11.5014ms | Exception | 62.5854ms |
FastNeuralStyle StaryNight | 27.498ms | 178.309ms | 160.359ms |
Inception 5h | 7.8546ms | 22.2789ms | 20.3255ms |
Inception v2 FasterRCNN | 112.736ms | Exception | 374.26ms |
MobileNet SSD | 58.4751ms | 9.2896ms | 27.3061ms |
OpenCV Face Detector | 6.9831ms | 8.3981ms | 17.6683ms |
OpenPose Pose MPI | 160.561ms | 509.446ms | 838.161ms |
Resnet 50 | 11.3603ms | 28.1529ms | 50.2752ms |
SqueezeNet | 2.4084ms | 3.2918ms | 5.476ms |
VGG16 SSD | 70.4117ms | 249.725ms | 360.207ms |
Yolo v3 | 57.9822ms | 214.629ms | 296.806ms |
Yolo v2 | 51.5784ms | 193.453ms | 260.19ms |
Model | CUDA FP32 | Inference Engine CPU | OpenCV CPU |
---|---|---|---|
GoogLeNet | 35.7556ms | 108.946ms | 225.928ms |
DenseNet121 | 74.9241ms | 295.105ms | 650.924ms |
EAST Text Detection | 149.58ms | 536.946ms | 1273.93ms |
FastNeuralStyle StaryNight | 283.173ms | 1966.5ms | 2175.3ms |
Inception 5h | 36.6225ms | 180.429ms | 233.276ms |
MobileNet SSD | 277.753ms | 111.872ms | 316.063ms |
OpenCV Face Detector | 52.4366ms | 95.7866ms | 202.657ms |
OpenPose Pose MPI | 628.617ms | 5650.05ms | 10683.5ms |
Resnet 50 | 74.283ms | 230.817ms | 541.308ms |
SqueezeNet | 15.8144ms | 35.4915ms | 69.4122ms |
VGG16 SSD | 594.286ms | 2796.23ms | 4661.51ms |
Yolo v3 | 488.704ms | 2419.8ms | 4209.74ms |
Yolo v2 | 491.414ms | 2185.47ms | 3788.34ms |
Model | CUDA FP32 | OpenCV OpenCL NVIDIA |
---|---|---|
GoogLeNet | 7.5951ms | 56.218ms |
DenseNet121 | 12.9375ms | 110.564ms |
EAST Text Detection | 19.1325ms | 309.341ms |
ENet | 11.8922ms | 38.8476ms |
FastNeuralStyle StaryNight | 29.69ms | 346.566ms |
Inception 5h | 8.8545ms | 57.4015ms |
Inception v2 FasterRCNN | 114.535ms | 2244.24ms |
MobileNet SSD | 57.6893ms | 148.459ms |
OpenCV Face Detector | 6.9666ms | 59.9923ms |
OpenPose Pose MPI | 162.01ms | 2377.14ms |
Resnet 50 | 11.9307ms | 176.066ms |
SqueezeNet | 2.4413ms | 14.6637ms |
VGG16 SSD | 70.8822ms | 1288.96ms |
Yolo v3 | 58.133ms | 1168.71ms |
Yolo v2 | 53.5697ms | 1016.73ms |
Model | OpenCV OpenCL IG | OpenCV OpenCL FP16 IG | OpenCV OpenCL NVIDIA |
---|---|---|---|
GoogLeNet | 15.5681ms | 11.7769ms | 56.218ms |
DenseNet121 | 49.4344ms | 56.3869ms | 110.564ms |
EAST Text Detection | 86.0381ms | 80.949ms | 309.341ms |
ENet | 27.4152ms | Exception | 38.8476ms |
FastNeuralStyle StaryNight | 105.712ms | 132.263ms | 346.566ms |
Inception 5h | 17.4537ms | 14.6988ms | 57.4015ms |
Inception v2 FasterRCNN | 358.018ms | 374.585ms | 2244.24ms |
MobileNet SSD | 20.5701ms | 21.3236ms | 148.459ms |
OpenCV Face Detector | 21.3481ms | 26.4779ms | 59.9923ms |
OpenPose Pose MPI | 888.518ms | 870.852ms | 2377.14ms |
Resnet 50 | 33.1333ms | 25.5099ms | 176.066ms |
SqueezeNet | 5.877ms | 5.865ms | 14.6637ms |
VGG16 SSD | 425.423ms | 353.651ms | 1288.96ms |
Yolo v3 | 339.913ms | 338.573ms | 1168.71ms |
Yolo v2 | 446.899ms | 314.487ms | 1016.73ms |
After I found an updated image for the jetson nano (cuda 10, cudnn 7.5.0), i have been able to compile opencv with cudadnn.
Unfortunately most tests fail with:
unknown file: Failure C++ exception with description "OpenCV(4.1.1-dev) /mnt/storage/src/opencv/opencv-cudadnn-buildtest/modules/dnn/src/cuda4dnn/csl/memory.hpp:263: error: (-217:Gpu API call) operation not supported in function 'MemoryLockGuard' " thrown in the test body.
I have seen this exception also on my x86 / GTX 1060 when trying to run some python scripts (some work, some fail with this error).
CMakeVars.txt buildlog.txt opencv-dnn-test-log.txt
BTW: Very impressive benchmark results you posted !
This has something to do with Jetson not supporting the ability to page-lock already allocated memory.
MemoryLockGuard
essentially page-locks host memory. This boosts the host to device memory transfer bandwidth (almost doubles on my PC) and also allows the transfer to happen asynchronously.
This also improves inference time when the network has layers which do not have CUDA implementations. It may be insignficant for large networks though.
But yes, it doesn't seem nice to rule out all the Jetson devices. I am currently thinking of adding a build option or maybe a runtime-option (disabled by default) to prevent page-locking.
An alternate solution is to allocate page-locked memory during allocation instead of having to later lock it. But this is mostly not possible as the host memory is allocated by the DNN backbone code which is independent of the CUDA backend.
Yes, there is a test failing on my x86 PC due to MemoryLockGuard
for attempting to lock memory which is already locked. I will be rolling out a fix for this soon. I am speculating that this is the error you are also facing.
Do you know what situation causes the error? What's different in the codes that work and that don't?
But yes, it doesn't seem nice to rule out all the Jetson devices. I am currently thinking of adding a build option or maybe a runtime-option (disabled by default) to prevent page-locking.
Could perhaps be auto enabled build option on Arm architecture as the referenced post seems to suggest it is a problem on Arm only ?
An alternate solution is to allocate page-locked memory during allocation instead of having to later lock it. But this is mostly not possible as the host memory is allocated by the DNN backbone code which is independent of the CUDA backend.
Yes, there is a test failing on my x86 PC due to
MemoryLockGuard
for attempting to lock memory which is already locked. I will be rolling out a fix for this soon. I am speculating that this is the error you are also facing.Do you know what situation causes the error? What's different in the codes that work and that don't?
No unfortunately I haven't been able to figure it out. All the hunches I had, I incorporated in the simple test script, but that doesn't fail.
I won't be adding new code to the PR anymore. The commits from now on will be limited to bug fixes and refactoring. I will create separate PRs and improve upon this PR in the coming months.
@alalek @dkurt PR is ready for review.
Got some test and perf results for the jetson nano. opencv-dnn-perf-log.txt opencv-dnn-test-log.txt
The Deconvolution3D test is failing because the DNN backbone code has attempted to allocate an internal blob with zero size which has been caught by an assertion in the CUDA backend. If I remember correctly, this is a recent issue which wasn't happening until I rebased a week ago.
/cc @alalek I am not sure if this is a bug. Deconvolution3D appears to request a zero sized internal blob.
[TODO BUG] The CUDA backend has it's own system of managing internal blobs and hence I shouldn't even be allocating from these internal blobs. I'll skip the wrap()
call for CUDA backend. This should free up some (maybe significant) precious GPU memory.
The SSD networks aren't failing on my PC. The outputs certainly look wrong though. Can you try running on the tests on your GTX 1060 and check if they fail? @applied-machinelearning
The SSD networks aren't failing on my PC. The outputs certainly look wrong though. Can you try running on the tests on your GTX 1060 and check if they fail? @applied-machinelearning
You are correct, they aren't failing on my other machine. opencv-dnn-test-log-pascal.txt
Main differences are (jetson vs my workstation with pascal card): cuda: 10.0 vs 10.1 cudnn: 7.5.0 vs 7.6.2 compute capability: 5.3 vs 6.0 / 6.1 (Maxwell vs Pascal, note that normal maxwell is 5.0 / 5.2, the jetson nano's 5.3 seems to have different handling around the FP16 stuff) arch: aarch64 vs x86 (thinking out loud: issue with something like endianness in handing over to cpu layers?)
Another thing i noticed: While I tried to compile with compute capability 5.0 / 5.2 on my workstation (trying to rule out if the compute capability has anything to do with the above problem). But that fails on the __half stuff which seems to be only necessary for the CUDA_FP16 stuff. If I remember correctly CUDA_FP16 only makes sense performance wise for later cards, so perhaps the whole CUDA_FP16 should depend on compute capability being at least pascal ? That would probably make all these compile errors go away. See the build log: buildlog.txt
Good news, commit e4e6759, seems to have fixed the last failing test: Test_ONNX_layers.Deconvolution3D/0, where GetParam() = CUDA/CUDA
I have done some more digging around with the python problem. After commit "ignore memory lock failures" it now fails on: cv2.error: OpenCV(4.1.1-dev) /mnt/storage/opencv/opencv-cudadnn/modules/dnn/src/cuda4dnn/csl/memory.hpp:54: error: (-217:Gpu API call) initialization error in function 'ManagedPtr'.
So I intstrumented that with cudaMemGetInfo and the requested malloc size. When using the simple python script it functions properly:
cudaMemGetInfo free: 6103695360 total: 6373179392 Malloc size: 2076672
But when trying the more involved script it fails:
cudaMemGetInfo free: 0 total: 0 Malloc size: 2076672
So it seems cuda isn't initialized properly. That is probably due to the more involved script being multiprocess / multithreaded. I will see if I can make a minimal python script that exhibits the problem.
Hello, thank you for your effords.
I got successful tests on platforms above with my 300x300px trained Inception v2 SSD tensorflow model:
Test Platform # 1 : GTX-1050-TI & i7-7700HQ & Win10 : ~22 fps on CPU (DNN_BACKEND_OPENCV & DNN_TARGET_CPU), ~49 fps on CPU with IE (DNN_BACKEND_INFERENCE_ENGINE & DNN_TARGET_CPU), ~46 fps on CUDA GPU (DNN_BACKEND_CUDA & DNN_TARGET_CUDA)
Test Platform # 2 : GTX-1050-Mobile & i5-8300H & Win10 : ~9 fps on CPU (DNN_BACKEND_OPENCV & DNN_TARGET_CPU), ~21 fps on CPU with IE (DNN_BACKEND_INFERENCE_ENGINE & DNN_TARGET_CPU), ~41 fps on CUDA GPU (DNN_BACKEND_CUDA & DNN_TARGET_CUDA)
@YashasSamaga Awesome work!
Any idea why CUDA MobileNet SSD performs badly? Probably some missing layers in the CUDA backend?
What could be the reasons of the bad performance of OpenCL backend on Nvidia? It performs worse than OpenCL on Intel IG.
Is it because the OpenCL drivers on Nvidia are suboptimal? Is it because the kernels are tuned for Intel IG? A combination of both maybe?
Any idea why CUDA MobileNet SSD performs badly? Probably some missing layers in the CUDA backend?
cuDNN performs very poorly for depthwise convolutions. It launches thousands of kernels. Hopefully, this will be fixed in a future version of cuDNN.
The only missing layer is DetectionOutputLayer which doesn't take a toll on the performance as it appears at the end of the network. In fact, it's generally faster on the CPU than on a CUDA device (it's also partly because of my inability to write a kernel which can outperform the CPU).
What could be the reasons of the bad performance of OpenCL backend on Nvidia? It performs worse than OpenCL on Intel IG. Is it because the OpenCL drivers on Nvidia are suboptimal? Is it because the kernels are tuned for Intel IG? A combination of both maybe?
The OpenCV backend's OCL implementation frequently uses the CPU target as a fallback. This target switch is very cheap as the IG and CPU share the same memory.
NVIDIA devices have their own dedicated graphics memory. Every time a fallback is used, you'll have to transfer the memory from the device to the host. This is very costly. So costly that any benefits that are to be gained are completely outweighed by the cost of the intermediate memory transfers.
OpenCL limits the ability to exploit the full capability of CUDA devices. It offers far less control than what a CUDA backend could have.
CPU: 2x Intel Xeon E5-2640 v4 (40 logical cores) GPU: 1x NVIDIA GTX 1080 Ti (11 GB)
CUDA Version: 10.0 cuDNN: 7.6.2
Warmup Runs: 3 (forward pass is performed three times before benchmarks) Benchmark Runs: 10 (the average of ten forward passes is reported)
Test Code: https://gist.github.com/YashasSamaga/71157cf0c3768c497e5e70fb95435596
Model | CUDA FP32 | OpenCV CPU |
---|---|---|
GoogLeNet | 4.8824ms | 14.2981ms |
DenseNet121 | 6.4555ms | 57.8244ms |
EAST Text Detection | 5.901ms | 67.4301ms |
ENet | 4.5979ms | 30.2767ms |
FastNeuralStyle StaryNight | 5.3193ms | 51.3313ms |
Inception 5h | 4.9487ms | 16.0048ms |
Inception v2 FasterRCNN | 82.0298ms | 179.245ms |
MobileNet SSD | 70.9177ms | 23.9348ms |
OpenCV Face Detector | 4.9288ms | 15.4205ms |
OpenPose Pose MPI | 30.5954ms | 246.747ms |
Resnet 50 | 4.5968ms | 45.1153ms |
SqueezeNet | 1.0888ms | 3.6492ms |
VGG16 SSD | 23.5926ms | 194.976ms |
Yolo v3 | 18.0002ms | 141.861ms |
Yolo v2 | 12.1279ms | 111.642ms |
Model | CUDA FP32 | OpenCV CPU |
---|---|---|
GoogLeNet | 10.149ms | 75.9591ms |
DenseNet121 | 20.269ms | 312.426ms |
EAST Text Detection | 32.1556ms | 402.16ms |
FastNeuralStyle StaryNight | 49.1025ms | 461.095ms |
Inception 5h | 9.9721ms | 67.9308ms |
MobileNet SSD | 96.2898ms | 110.783ms |
OpenCV Face Detector | 22.7501ms | 77.8742ms |
OpenPose Pose MPI | 118.858ms | 2321.89ms |
Resnet 50 | 18.4139ms | 229.599ms |
SqueezeNet | 4.4893ms | 22.3049ms |
VGG16 SSD | 194.181ms | 1319.67ms |
Yolo v3 | 122.603ms | 1044.11ms |
Yolo v2 | 104.072ms | 819.177ms |
Model | CUDA FP32 | OpenCV CPU |
---|---|---|
GoogLeNet | 90.3755ms | 775.769ms |
DenseNet121 | 199.516ms | 3536.38ms |
EAST Text Detection | 376.458ms | 7685.72ms |
FastNeuralStyle StaryNight | 801.778ms | 6607.15ms |
Inception 5h | 93.4188ms | 771.575ms |
MobileNet SSD | 1028.93ms | 1110.37ms |
OpenCV Face Detector | 276.992ms | 977.997ms |
OpenPose Pose MPI | 1279.26ms | 32159.3ms |
Resnet 50 | 200.789ms | 1719.92ms |
SqueezeNet | 55.6244ms | 255.397ms |
VGG16 SSD | 2969.05ms | 17201ms |
Yolo v3 | 1564.78ms | 13699.2ms |
Yolo v2 | 1362.84ms | 11254.9ms |
Model | batch size = 1 | batch size = 10 | batch size = 128 |
---|---|---|---|
GoogLeNet | 204 | 985 | 1416 |
DenseNet121 | 154 | 493 | 641 |
EAST Text Detection | 169 | 311 | 340 |
ENet | 217 | Not Applicable | Not Applicable |
FastNeuralStyle StaryNight | 188 | 204 | 160 |
Inception 5h | 202 | 1002 | 1370 |
Inception v2 FasterRCNN | 12 | Not Aplicable | Not Applicable |
MobileNet SSD | 14 | 104 | 124 |
OpenCV Face Detector | 202 | 440 | 462 |
OpenPose Pose MPI | 33 | 84 | 100 |
Resnet 50 | 217 | 540 | 637 |
SqueezeNet | 918 | 2228 | 2301 |
VGG16 SSD | 42 | 52 | 43 |
Yolo v3 | 55 | 82 | 81 |
Yolo v2 | 82 | 96 | 93 |
Model | CUDA FP32 | TensorFlow |
---|---|---|
ResNet-50 | 4.5968ms | 7.1163ms |
EAST Text Detection | 5.901ms | 8.6890ms |
Model | CUDA FP32 | TensorFlow |
---|---|---|
ResNet-50 | 18.4139ms | 22.3665ms |
EAST Text Detection | 32.1556ms | 39.4857ms |
Model | CUDA FP32 | TensorFlow |
---|---|---|
ResNet-50 | 200.789ms | 216.3923ms |
EAST Text Detection | 376.458ms | 421.8292ms |
@YashasSamaga Here is a simple python script that shows initialization problems with CUDA and multiprocessing in python. I have no idea if it works with c++ etc.
It also causes an error when setting the backend to CPU in the other process.
https://gist.github.com/applied-machinelearning/9462e1368065fd7bf93334b0130a6ba0
Starting detection on CPU from mainprocess Results of detection on CPU from mainprocess [[0.01141005 0.00488871 0.02028161 ... 0. 0. 0. ] ... [0.98707837 0.98927605 0.13942897 ... 0. 0. 0. ]]
Starting detection on CUDA from mainprocess Results of detection on CUDA from mainprocess [[0.01141005 0.00488872 0.02028161 ... 0. 0. 0. ] ... [0.98707837 0.98927605 0.13942908 ... 0. 0. 0. ]]
Starting detection on CPU from multiprocessing terminate called after throwing an instance of 'cv::dnn::cuda4dnn::csl::cublas::cuBLASException' what(): OpenCV(4.1.1-dev) /mnt/storage/opencv/opencv-cudadnn-really-working/modules/dnn/src/cuda4dnn/csl/cublas.hpp:63: error: (-217:Gpu API call) CUBLAS_STATUS_NOT_INITIALIZED in function 'UniqueHandle'
Starting detection on CUDA from multiprocessing terminate called after throwing an instance of 'cv::dnn::cuda4dnn::csl::cublas::cuBLASException' what(): OpenCV(4.1.1-dev) /mnt/storage/opencv/opencv-cudadnn-really-working/modules/dnn/src/cuda4dnn/csl/cublas.hpp:63: error: (-217:Gpu API call) CUBLAS_STATUS_NOT_INITIALIZED in function 'UniqueHandle'
multiprocessing in python
Perhaps we need to block fork()
calls from Python: #5150
multiprocessing.set_start_method('spawn')
should help.
@alalek Just tested and that works for me !
Any ideas on: https://github.com/opencv/opencv/pull/14827#issuecomment-522737374 ?
@applied-machinelearning I have tested on several mobile and desktop GPUs but have not been able to reproduce the failures. It might be specific to Jetson but I don't have access to one (at least for now).
Can you try recloning (or maybe a hard reset?) repositories (opencv_extra and my fork) and run the tests again?
EDIT: I have verified with someone else. The tests are failing on Jetson Nano (with CUDA 10.0 and cuDNN 7.5.0).
The issue is not with CUDA 10.0 or CUDA 10.1 as I have tested both on desktop GPUs. I haven't used cuDNN other than 7.6.2. So it's either cuDNN or something wrong in my code which breaks only in Jetson Nano.
I also did some testing with the Jetson Nano; here are my results.
It appears that the SSD networks are failing in the tests.
@YashasSamaga I retested with pristine cloned trees and downloads and still got these failing tests.
[ FAILED ] DNNTestNetwork.MobileNet_SSD_v2_TensorFlow/0, where GetParam() = CUDA/CUDA [ FAILED ] DNNTestNetwork.SSD_VGG16/0, where GetParam() = CUDA/CUDA [ FAILED ] DNNTestNetwork.Inception_v2_SSD_TensorFlow/0, where GetParam() = CUDA/CUDA [ FAILED ] Test_TensorFlow_nets.Inception_v2_SSD/0, where GetParam() = CUDA/CUDA
I previously tried to compile for maxwell generation on my workstation (with pascal card) but that failed on fp16 support lacking with sm_50 and sm_52, but i forgot to test with sm_53. Now i did and that compiles. So with current code and no way to disable FP16 support, sm_53 is the lowest compute capability that compiles. I haven't checked yet what this means for the other (older) jetson boards. EDIT: just checked, Jetson TK1 = 3.2, jetson TX1 == maxwell (5.3), jetson TX2 == pascal, Jetson Xavier == volta. So only the old TK1 would be a problem on the embedded side of things. So either separating and making optional of FP16 support or just requiring sm_53 in CMake would seem sensible ?
Running the tests with opencv compiled for only sm_53 on the pascal card on X86 (cuda 10.1 cudnn 7.6.2) gives no failing tests, so the compute capability doesn't seem to matter.
So it's either something in cudnn 7.5.0, nvidia CUDA / cudnn libraries on arm64, or something in the opencv cudadnn code on arm64. Which makes me wonder, what is so special about these failed models that they fail and all other models and test succeed on the Jetson Nano ? Any special layers ? Any special layers which are CPU only ?
@YashasSamaga thank you for your great contribution! May I ask if you know if MaskRCNN will also work using your CUDA backend? I works using OpenCV backend. Will your CUDA/CUDNN backend autofallback layers on other backend if they arend supported? I added the pbtxt if youre interested what layers the network uses:
@tompollok
The CUDA backend uses OpenCV CPU backend as a fallback for unsupported layers or layer configurations. The fallbacks are quite costly though.
I used your .pbtxt and the .pb from here.
Inception v2 Mask RCNN
OCV CPU Time: 3280ms
CUDA Total Time: 407ms
Relative Error >> Total: 0.787171, Average: 1.94363e-07, Max: 1.65757e-06
7700HQ and GTX 1050 were the devices used in the test. Every output value from the CUDA backed is compared against the corresponding output from the OpenCV CPU backend. The relative error is calculated as:
error = abs(x - y) / max(max(abs(x), abs(y)), eps) where eps = 1e-7
If you use the CUDA backend in a debug build, it will print the layers for which fallback is used. The CUDA backend uses OCV CPU fallback for the following layers:
[ INFO:0] global E:\Repositories\opencv\modules\dnn\src\dnn.cpp (1820) cv::dnn::dnn4_v20190621::Net::Impl::initCUDABackend CUDA backend will fallback to the CPU implementation for the layer "_input" of type __NetInputLayer__
[ INFO:0] global E:\Repositories\opencv\modules\dnn\src\dnn.cpp (1820) cv::dnn::dnn4_v20190621::Net::Impl::initCUDABackend CUDA backend will fallback to the CPU implementation for the layer "detection_out" of type DetectionOutput
[ INFO:0] global E:\Repositories\opencv\modules\dnn\src\dnn.cpp (1820) cv::dnn::dnn4_v20190621::Net::Impl::initCUDABackend CUDA backend will fallback to the CPU implementation for the layer "CropAndResize" of type CropAndResize
[ INFO:0] global E:\Repositories\opencv\modules\dnn\src\dnn.cpp (1820) cv::dnn::dnn4_v20190621::Net::Impl::initCUDABackend CUDA backend will fallback to the CPU implementation for the layer "detection_out_final" of type DetectionOutput
[ INFO:0] global E:\Repositories\opencv\modules\dnn\src\dnn.cpp (1820) cv::dnn::dnn4_v20190621::Net::Impl::initCUDABackend CUDA backend will fallback to the CPU implementation for the layer "CropAndResize_1" of type CropAndResize
I will be adding CropAndResize layer and NetInputLayer soon which should bring down the inference time considerably. Adding support for the detection output layer is a bit tricky because of NMS. I have tried but haven't been able to beat the CPU version of NMS. I'll hopefully get it working someday.
If you use the CUDA backend in a debug build, it will print the layers for which fallback is used.
I think that info would also be helpful for non-debug builds (users of released distro versions). Could it be an idea to amend the id of the backend used to getPerfProfile() (or have it's own function to get that info per layer) ?
info would also be helpful for non-debug builds
Environment variable OPENCV_LOG_LEVEL=INFO
should help with messages in release builds.
@tompollok
The CUDA backend uses OpenCV CPU backend as a fallback for unsupported layers or layer configurations. The fallbacks are quite costly though.
I used your .pbtxt and the .pb from here.
Inception v2 Mask RCNN OCV CPU Time: 3280ms CUDA Total Time: 407ms Relative Error >> Total: 0.787171, Average: 1.94363e-07, Max: 1.65757e-06
7700HQ and GTX 1050 were the devices used in the test. Every output value from the CUDA backed is compared against the corresponding output from the OpenCV CPU backend. The relative error is calculated as:
error = abs(x - y) / max(max(abs(x), abs(y)), eps) where eps = 1e-7
If you use the CUDA backend in a debug build, it will print the layers for which fallback is used. The CUDA backend uses OCV CPU fallback for the following layers:
[ INFO:0] global E:\Repositories\opencv\modules\dnn\src\dnn.cpp (1820) cv::dnn::dnn4_v20190621::Net::Impl::initCUDABackend CUDA backend will fallback to the CPU implementation for the layer "_input" of type __NetInputLayer__ [ INFO:0] global E:\Repositories\opencv\modules\dnn\src\dnn.cpp (1820) cv::dnn::dnn4_v20190621::Net::Impl::initCUDABackend CUDA backend will fallback to the CPU implementation for the layer "detection_out" of type DetectionOutput [ INFO:0] global E:\Repositories\opencv\modules\dnn\src\dnn.cpp (1820) cv::dnn::dnn4_v20190621::Net::Impl::initCUDABackend CUDA backend will fallback to the CPU implementation for the layer "CropAndResize" of type CropAndResize [ INFO:0] global E:\Repositories\opencv\modules\dnn\src\dnn.cpp (1820) cv::dnn::dnn4_v20190621::Net::Impl::initCUDABackend CUDA backend will fallback to the CPU implementation for the layer "detection_out_final" of type DetectionOutput [ INFO:0] global E:\Repositories\opencv\modules\dnn\src\dnn.cpp (1820) cv::dnn::dnn4_v20190621::Net::Impl::initCUDABackend CUDA backend will fallback to the CPU implementation for the layer "CropAndResize_1" of type CropAndResize
I will be adding CropAndResize layer and NetInputLayer soon which should bring down the inference time considerably. Adding support for the detection output layer is a bit tricky because of NMS. I have tried but haven't been able to beat the CPU version of NMS. I'll hopefully get it working someday.
Thats great news! Is there a way to list the single or average or accumulated forwording time in ms per layer for when forwording to see where bottlenecks may occur?
Currently, there is no simple way. The numbers returned by getPerfProfile
are not accurate for CUDA backend.
The backend has NVTX integration which allows you to compute the timings for each layer using NVIDIA's profiling tools. The NVTX integration marks regions of layers which allows you to exactly identify the time taken by the layers in NVIDIA's profiling tools.
The NVTX integration can be enabled by adding CUDA4DNN_ENABLE_NVTX
preprocessor symbol while building the DNN module.
The timings can be computed for the layers within the code by adding events at the beginning and the end of every layer. This would allow getPerfProfile
to return accurate timings. I need to investigate the performance impacts (mostly negligible) of using CUDA's event API for timing and then decide what to do.
forwardAsync
for CUDA backend does not do what it does for IE.
For the CUDA backend, it dumps the operations to the device and returns immediately so that the calling thread can continue. It's not possible to call forwardAsync
until the previous operations finish.
This overloads the meaning for forwardAsync
. Any user can mimic this forwardAsync
behaviour of CUDA backend on their own so I don't see why it should be a part of the CUDA backend.
I think I should revert https://github.com/opencv/opencv/pull/14827/commits/1154b9da9da07e9b52f8a81bdcea48cf31c56f70 or make it behave like IE forwardAsync
(which isn't trivial).
forwardAsync
for CUDA backend does not do what it does for IE.For the CUDA backend, it dumps the operations to the device and returns immediately so that the calling thread can continue. It's not possible to call
forwardAsync
until the previous operations finish.
The calling thread can call forwardAsync
again, but you mean that if they do that they will get the wrong results right?
This overloads the meaning for
forwardAsync
. Any user can mimic thisforwardAsync
behaviour of CUDA backend on their own so I don't see why it should be a part of the CUDA backend.I think I should revert [1154b9d]
Assuming the above is all correct, reverting this and making forwardAsync
just not supported for the CUDA backend seems like the right thing to do, since it's important for methods that claim to implement an interface (i.e. forwardAsync
) to all actually conform to interface's contract.
@davisking Yes, if the calling thread calls forwardAsync
before the previous request has been completed, they will get wrong results.
getPerfProfile()
I have tried using events to allow getPerfProfile
to report timings accurately. AlexNet benchmark time shot up by 1ms (6ms from 5ms) when events were added at the beginning and the end of every layer. Hence, I think it's not a good idea to add it unless there is a way to optionally enable/disable profiling in the cv::Net
interface.
CPU: 7700HQ GPU: GTX 1050 Mobile
The timings are in milliseconds.
Depthwise convolutions in cuDNN are insanely slow. So bad that the CPU takes 0.9ms
and the GPU takes 20ms
in conv13/dw
. I have seen it launching huge number of kernels. I suspect it launches one kernel per group which is inefficient.
DetectionOutput is slow because it's performed on CPU which requires the data to be transfered from the GPU to the CPU.
The priorbox, concat and permute operations are slower on the GPU because the operands for those operations are too small. Many of these operations use just a single digit or two digit number of cores even though my GPU has 768 cores.
/cc @catree might be relevant as you were interested to know why MobileNet performs so badly
i'm getting the following build errors:
/home/peter/Downloads/opencv/modules/dnn/src/cuda/math.hpp(23): error: identifier "hexp" is undefined
/home/peter/Downloads/opencv/modules/dnn/src/cuda/math.hpp(24): error: identifier "h2exp" is undefined
/home/peter/Downloads/opencv/modules/dnn/src/cuda/math.hpp(29): error: identifier "hexp" is undefined
/home/peter/Downloads/opencv/modules/dnn/src/cuda/math.hpp(30): error: identifier "h2exp" is undefined
/home/peter/Downloads/opencv/modules/dnn/src/cuda/math.hpp(53): error: identifier "hlog" is undefined
/home/peter/Downloads/opencv/modules/dnn/src/cuda/math.hpp(54): error: identifier "h2log" is undefined
/home/peter/Downloads/opencv/modules/dnn/src/cuda/math.hpp(59): error: more than one conversion function from "half" to a built-in type applies: function "half::operator float() const" function "half::operator short() const" function "half::operator unsigned short() const" function "half::operator int() const" function "half::operator unsigned int() const" function "half::operator long long() const" function "half::operator unsigned long long() const" function "half::operator nv_bool() const"
@pfeatherstone Can you upload your CMakeCache.txt? What device do you have?
@YashasSamaga I'm running Ubuntu18, cuda 10.1, cudnn 7 and i have two titan X. CMakeCache.txt
The CUDA backend provides a half-precision (DNN_TARGET_CUDA_FP16
) target to further acclerate DNN inference. It makes use of half-precision intrinsics which are supported in devices with CC 5.3 and above only.
@pfeatherstone Your GPU's compute capability is 6.1 which is good enough. (Source: https://developer.nvidia.com/cuda-gpus#compute)
In your CMakeCache.txt
, the CUDA_ARCH_BIN
option appears to have been set to 3.0 3.5 3.7 5.0 5.2 6.0 6.1 7.0 7.5
(which I think is the default). Please change this option to 6.1
or a list of your choice where all architectures are CC 5.3+.
After building you can try running the tests:
[build dir]/bin/opencv_test_dnn
[build dir]/bin/opencv_perf_dnn
[TODO DONE] throw an error while configuring CMake for unsupported compute capabilities
This CC limitation is only due to the half-precision support. It's possible to have a build option to enable or disable half-precision support in the CUDA backend. But I wonder if this is of any use. I don't think people still use very old GPUs and hence such an option might not be very useful.
Hi @YashasSamaga ,
Thanks for your reply earlier on. I have changed CUDA_ARCH_BIN
as per your suggestion.
I am trying to compile on windows 10. Still getting errors.
@Avrohom Can you try building once again from a fresh clone? You seem to have errors from many modules.
I am able to build on my PC with VS17. What compiler are you using?
Hi @YashasSamaga,
Many thanks for your kind help. I will do a fresh clone. I was using VS19. Used it to successfuly build the official OpenCv Master.
@YashasSamaga !
Nope. Not working. Did a fresh clone. Tried with VS2017.
@Avrohom
opencv_world.vcxproj
Why are you building in opencv_world mode? CUDA is not well supported in this mode.
@alalek. Well, I managed to get the official Master branch built in opencv_world mode. It included CUDA.
Looks like it blames on std::max
- common windows problem: https://stackoverflow.com/questions/13416418/define-nominmax-using-stdmin-max/13420838
Try to add #define NOMINMAX
somewhere (precomp.hpp or CMake via add_definitions(/DNOMINMAX)
)
Hi @alalek, @YashasSamaga ,
Many thanks,
Yes, I had to repair that std::max
issue. Didn't knew where to #define NOMINMAX
though, (haven't noticed your reply than), so, I basically changed it to (std::max)(...)
.
Also, did manually change the code in modules/gapi/include/opencv2/gapi/infer.hpp to apply the fix as described at https://github.com/opencv/opencv/commit/212f0fb5093ff8353cce602084d60225061f79f1#diff-e56cd60011ddca49858f11ce28fd3c31.
I am working with the 'opencv-cuda4dnn-csl-low' branch. Because it looked to me to be the most up to date. Am I missing something? To which branch has the above fix been merged to?
I can confirm now that I have been able to build that repository successfully on windows including the opencv_world option.
Interesting though, that when testing that build that has been compiled in debug mode for Mask R CNN inference, it takes approx 1,150ms per frame to process and I do get a few messages about the library 'falling back' onto CPU for various operations, whereas when running the Release version, I do not get any of those 'fallback' messages and processing takes +-350ms / frame. Such a big difference is quite strange.
@Avrohom
I have run into the std::max/std::min
issue after rebasing. I think a fix for MSVC should be added in cmake globally.
The debug build prevents many compiler optimizations which would improve the inference time. IMO, such a large difference in timings is normal.
The fallback messages are not displayed in release builds but the fallbacks are used. Refer to this comment to enable the messages in non-debug builds.
What are the next steps on this PR? I see we are waiting for reviews from @alalek and @dkurt. Is there anything holding up those reviews yet to be completed by @YashasSamaga?
More up-to-date info available here (unofficial)
How to use build and use the CUDA backend?
How to use multiple GPUs?
Benchmarks
Demo Video: https://www.youtube.com/watch?v=ljCfluWYymM
Project summary/benchmarks: https://gist.github.com/YashasSamaga/a84cf2826ab2dc755005321fe17cd15d
Support Matrix for this PR
## Current Support Matrix: (not updated) Blip | Meaning ---- | --------- ✔️ | supports all the configurations that are supported by all the existing backends (and might support more than what's currently supported) 🔵 | partially supported (fallback to CPU for unsupported configurations) :x: | not supported (fallback to CPU) Layer | Status | Constraints | Notes ---------------------------------------- | ------ | ------------- | -------------- Activations | ✔️ Batch Normalization | ✔️ Blank Layer | ✔️ Concat Layer | ✔️ Const Layer | ✔️ Convolution 2d | ✔️ | | asymmetric padding is disabled in layer constructor but the backend supports it Convolution 3d | ✔️ | | asymmetric padding is disabled in layer constructor but the backend supports it Crop and resize | :x: | Crop Layer | ✔️ | | forwarded to Slice Layer Detection Output Layer | :x: | Deconvolution 2d | 🔵 | padding configuration should not lead to extra uneven padding Deconvolution 3d | 🔵 | padding configuration should not lead to extra uneven padding Elementwise Layers | ✔️ | Eltwise Layer | ✔️ | Flatten Layer | ✔️ | Fully Connected Layer | ✔️ | Input Layer | :x: | Interp Layer | ✔️ | Local Response Normalization | ✔️ | Max Unpooling 2d | ✔️ | Max Unpooling 3d | ✔️ | MVN Layer | :x: | Normalize Layer | 🔵 | Only L1 and L2 norm supported Padding Layer | ✔️ Permute Layer | ✔️ Pooling 2d | 🔵 | Only max and average pooling supported | supports asymmetric padding Pooling 3d | 🔵 | Only max and average pooling supported | supports asymmetric padding Prior Box Layer | ✔️ Proposal Layer | :x: Region Layer | ✔️ | NMS performed using CPU Reorg Layer | ✔️ | Reshape Layer | ✔️ | Resize Layer | ✔️ Scale Layer | ✔️ Shift Layer | ✔️ | | forwarded to Scale Layer Shuffle Channel Layer | ✔️ Slice Layer | ✔️ Softmax Layer | ✔️ Split Layer | ✔️ LSTM Layer | :x:Known issues:
References: #14585
Results: