opencv / opencv

Open Source Computer Vision Library
https://opencv.org
Apache License 2.0
78.94k stars 55.8k forks source link

When running the C++OpenCV code on RK3588 hardware, the program was able to run normally before, but would occasionally crash at some point. The Log showed that the error occurred in the CV::Mat::Copyto.::upload functiong #24489

Open JackIRose opened 1 year ago

JackIRose commented 1 year ago

System Information

OpenCV version: 4.7.0 Operating System / Platform: Ubuntu 20.04 Compiler & compiler version: GCC 11.3

Detailed description

We are working on a 360 panoramic splicing project. In the code, we use the CV::Mat::Copyto function under multi-threading to copy the pictures obtained by the camera from the CPU to the GPU, and convert them from mat to Umat format. Then it runs normally on the bus, but the program crashes after about a week. This has happened three times. Through the logs, it is found that it is a problem with the copyto::upload function code in Opencv. We cannot reproduce this problem in the laboratory environment because it can always run normally in the laboratory. Therefore, we suspect that it is a problem with the opencv::copyto function.

The detailed log is as follows

0x00000071b287968 in cv::error (cv: Exception const&) () from /mnt/system/mvsystem/lib/libKeylabAVM.so 0x0000007fb28b8b9c in cv::error (int, Cxx11: :basic string<char, std:: char traits, std: :allocator > const&, char const, char const, int) () from /mnt/system/mvsystem/lib/libKeylabAVM.so 0x0000007fb284b47c in cv::oc1::OpenCLAllocator: :upload (cv::UMatData, void const, int, unsigned long unsigned long const, unsigned long const, () from /mnt/system/mvsystem/lib/libKeylabAVM.so 579 580 581 282 583 584 585 :86 587 0x0000007£b2762988 in cv: Mat: :copyTo (cv: :_OutputArray consts) from /mnt/system/mvsystem/lib/libKeylabAVM.so 0x00000071b26791a8 in nameKeylabAVM: KeylabAVMOnline:: Inference RectifyBird

Steps to reproduce

278255766-509eae80-0715-4b6e-b7f7-ac2482f44675

Issue submission checklist

asmorkalov commented 1 year ago

Please attach reproducer code with images and full log with error.

Kumataro commented 1 year ago

Sorry this is only advice to write bug issue, and it is not to fix problem directly.

  1. this is duplicated with https://github.com/opencv/opencv/issues/24455 . Did any condition and/or result change ?
  2. Please could you update OpenCV 4.8.1 or 4.x ? OpenCV 4.7.0 is not latest. It seems that you checked I updated to the latest OpenCV version and the issue is still there,
  3. Please could you build with debug configuration and find out where calling cv::Error() ?
  4. Please could you share program log as text instead of pictures/images ? We want to look for it with keyword.

And maybe I guess this is not OpenCV bug. If OpenCV detects any trouble/wrong conditio, it calls cv::Error() to tell for you. However this log has no information about it. So it is hard to comment/help .

( e.g. I suspect that the memory allocated to the GPU is being exhausted. )

JackIRose commented 1 year ago

Screenshot from 2023-11-06 13-51-27 (1) ![PHOTO_20231106_135825980](https://github.com/opencv/opencv/assets/39620243/c47bb951-95d8-482c-b14c-cfc8cc9937b4

Figure 1 is the main function, Figure 2 is the code that causes the program to crash (marked in red).

The txt file below is the complete log report. 1026_gdb_bt_full.txt

Kumataro commented 1 year ago

I'm sorry but It seems like you want to say that copyTo() is the cause, but there isn't enough evidence to confirm that.

There are no information about it.

#8  0x0000007f9da691fc in cv::error(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*, char const*, int) ()
   from /mnt/system/mvsystem/lib/libKeylabAVM.so
No symbol table info available.
#9  0x0000007f9d9fbadc in cv::ocl::OpenCLAllocator::upload(cv::UMatData*, void const*, int, unsigned long const*, unsigned long const*, unsigned long const*, unsigned long const*) const ()
   from /mnt/system/mvsystem/lib/libKeylabAVM.so
No symbol table info available.
#10 0x0000007f9d912fe8 in cv::Mat::copyTo(cv::_OutputArray const&) const () from /mnt/system/mvsystem/lib/libKeylabAVM.so
No symbol table info available.

So We don't/cannot know where exception is happened at cv::ocl::OpenCLAllocator: :upload(). Only you can identify at which line the Exception occurs. Please could you detemine it ? Perhaps there is a reason.

Code: https://github.com/opencv/opencv/blob/4.7.0/modules/core/src/ocl.cpp#L6228-L6230

Method:

I think it will be difficult to help you further unless new information is shared to us.

zirid commented 12 months ago

A quick look at the code, it seems like nothing ensures thatImageSrcRoi was successfully created. check if that image was created before copying. I would also check that ROI rect at the 3rd line.

JackIRose commented 12 months ago

I'm sorry but It seems like you want to say that copyTo() is the cause, but there isn't enough evidence to confirm that.

There are no information about it.

#8  0x0000007f9da691fc in cv::error(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*, char const*, int) ()
   from /mnt/system/mvsystem/lib/libKeylabAVM.so
No symbol table info available.
#9  0x0000007f9d9fbadc in cv::ocl::OpenCLAllocator::upload(cv::UMatData*, void const*, int, unsigned long const*, unsigned long const*, unsigned long const*, unsigned long const*) const ()
   from /mnt/system/mvsystem/lib/libKeylabAVM.so
No symbol table info available.
#10 0x0000007f9d912fe8 in cv::Mat::copyTo(cv::_OutputArray const&) const () from /mnt/system/mvsystem/lib/libKeylabAVM.so
No symbol table info available.

So We don't/cannot know where exception is happened at cv::ocl::OpenCLAllocator: :upload(). Only you can identify at which line the Exception occurs. Please could you detemine it ? Perhaps there is a reason.

Code: https://github.com/opencv/opencv/blob/4.7.0/modules/core/src/ocl.cpp#L6228-L6230

Method:

  • building opencv with debug option
  • adding verbose log

I think it will be difficult to help you further unless new information is shared to us.

After review, we located the specific openCV error location, which is in the CV::ocl::OpenCLALLocator::upload line, but we still cannot decipher what problem caused the crash and the role of this line of function. The specific details log and pictures are as follows.

Screenshot from 2023-11-08 09-25-21 Log: what(): OpenCV(4.7.0) /home/hisense/桌面/SoftTokit/opencv-4.7.03/modules/core/src/ocl.cpp:6330: error: (-220:Unknown error code -220) OpenCL error CL_INVALID_COMMAND_QUEUE (-36) during call: clEnqueueWriteBuffer(q, handle=0x7f3c01e010, CL_TRUE, offset=0, sz=2188800, data=0x7f3c9386c0, 0, 0, 0) in function 'upload'

![Uploading Screenshot from 2023-11-08 09-25-21.png…]()

Kumataro commented 12 months ago

Thank you for your investigation! It's quite a deep topic, so my help may not be enough. sorry.

https://github.com/opencv/opencv/blob/725e440d278aca07d35a5e8963ef990572b07316/modules/core/src/ocl.cpp#L6325-L6332

Umm... (if my recognize is wrong, sorry) it seems that CL_INVALID_COMMAND_QUEUE means q is invalid.

https://man.opencl.org/clEnqueueWriteBuffer.html

CL_INVALID_COMMAND_QUEUE if command_queue is not a valid command-queue.

q comes from here.

https://github.com/opencv/opencv/blob/725e440d278aca07d35a5e8963ef990572b07316/modules/core/src/ocl.cpp#L6262-L6264

I believe that this source code shows usually returns valid same queue. queue_ is initilized once and it is reused many times.

https://github.com/opencv/opencv/blob/725e440d278aca07d35a5e8963ef990572b07316/modules/core/src/ocl.cpp#L3408-L3418

https://github.com/opencv/opencv/blob/725e440d278aca07d35a5e8963ef990572b07316/modules/core/src/ocl.cpp#L1050-L1054

https://github.com/opencv/opencv/blob/725e440d278aca07d35a5e8963ef990572b07316/modules/core/src/ocl.cpp#L897-L907

(Sorry, it's a low possibility)if possible, please could you check whether there is a return with L3417 ? A case where the queue becomes invalid should not be expected here...

JackIRose commented 12 months ago

(Sorry, it's a low possibility)if possible, please could you check whether there is a return with L3417 ? A case where the queue becomes invalid should not be expected here...

Thank very much for your answer. After the program crashes, the system also triggers an error log, prompting page allocation failure. Will this affect the validity of q? The log is following:

25186.354250] avm360Main: page allocation failure: order:7, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0 [25186.354291] CPU: 1 PID: 1775 Comm: avm360Main Not tainted 5.10.110 #83 [25186.354295] Hardware name: Rockchip RK3588 EVB7 LP4 V10 Board (DT) [25186.354300] Call tr2023-11-06 12:36:48 ##G3_DEBUG[g3module.c,147]Send AT: AT+CSQ

If you encounter such a problem, do you have any good solutions? This has really bothered us for a long time. We would be grateful if you could give us some valuable advice.