intel / compute-runtime

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver
MIT License
1.1k stars 229 forks source link

Deadlock when changing GPU frequency during upload #654

Open russelltg opened 1 year ago

russelltg commented 1 year ago

Pretty minor bug here but a bug nonetheless--we have an OpenCL-accelerated application using OpenCV, and I ran sudo intel_gpu_frequency -m while it was running and it locked up inside the OpenCL driver. It doesn't happen every time so I theorize that it only deadlocks if during an upload.

Here's the stack:

    frame #0: 0x00007fffdb4e0cab libc.so.6`__sched_yield at syscall-template.S:120
    frame #1: 0x00007ffeed55410e libigdrcl.so`NEO::CommandStreamReceiver::baseWaitFunction(unsigned int volatile*, NEO::WaitParams const&, unsigned int) at gthr-default.h:693:32
    frame #2: 0x00007ffeed470330 libigdrcl.so`NEO::CommandStreamReceiverHw<NEO::TGLLPFamily>::waitForTaskCountWithKmdNotifyFallback(unsigned int, unsigned long, bool, NEO::QueueThrottle) at command_stream_receiver_hw_base.inl:861:47
    frame #3: 0x00007ffeed0c5186 libigdrcl.so`NEO::CommandQueue::waitUntilComplete(unsigned int, NEO::Range<NEO::CopyEngineState>, unsigned long, bool, bool, bool) at command_queue.cpp:259:91
    frame #4: 0x00007ffeed0c8c33 libigdrcl.so`NEO::CommandQueue::waitForAllEngines(bool, NEO::PrintfHandler*, bool) at command_queue.cpp:1044:46
    frame #5: 0x00007ffeed2526a1 libigdrcl.so`NEO::CommandQueueHw<NEO::TGLLPFamily>::finish() at command_queue.h:218:39
    frame #6: 0x00007ffeed0ca1db libigdrcl.so`NEO::CommandQueue::cpuDataTransferHandler(NEO::TransferProperties&, NEO::EventsRequest&, int&) at cpu_data_transfer_handler.cpp:97:23
    frame #7: 0x00007ffeed253902 libigdrcl.so`NEO::CommandQueueHw<NEO::TGLLPFamily>::enqueueReadWriteBufferOnCpuWithMemoryTransfer(unsigned int, NEO::Buffer*, unsigned long, unsigned long, void*, unsigned int, _cl_event* const*, _cl_event**) at command_queue_hw_base.inl:64:27
    frame #8: 0x00007ffeed2ac73d libigdrcl.so`NEO::CommandQueueHw<NEO::TGLLPFamily>::enqueueReadBuffer(NEO::Buffer*, unsigned int, unsigned long, unsigned long, void*, NEO::GraphicsAllocation*, unsigned int, _cl_event* const*, _cl_event**) at enqueue_read_buffer.h:62:65
    frame #9: 0x00007ffeed0990e2 libigdrcl.so`clEnqueueReadBuffer at api.cpp:2309:50
    frame #10: 0x00007fffecbe85d4 libopencv_core4d.so.407`cv::ocl::OpenCLAllocator::download(this=0x000055555fc62140, u=0x00007ffc5415b010, dstptr=0x00007ffc72750040, dims=2, sz=0x00007ffcab27f660, srcofs=0x00007ffcab27f560, srcstep=0x00007ffcab27fc40, dststep=0x00007ffcab27f3d0) const at ocl.cpp:6194:17
    frame #11: 0x00007fffecc9639d libopencv_core4d.so.407`cv::UMat::copyTo(this=0x00007ffcab27fc00, _dst=0x00007ffcab27f9d0) const at umatrix.cpp:1184:23

Setup: Ubuntu 22.04 intel-opencl-icd=22.14.22890-1 Kernel: 6.3.8-arch1-1 (ubuntu is running in a docker container, but that shouldn't affect any of this)

JablonskiMateusz commented 1 year ago

Hi @russelltg please try with more recent driver

tazz4843 commented 5 months ago

Can still reproduce today. Currently debugging this and will share more information when I have a chance (feel free to ping me if I forget)

tazz4843 commented 5 months ago

Looks like what I had just looked similar, opened a new issue as it's unrelated: #706