intel / compute-runtime

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver
MIT License
1.1k stars 229 forks source link

segfault in NEO::MapOperationsHandler::findInfoForHostPtr #640

Closed richy759 closed 1 year ago

richy759 commented 1 year ago

I'm running an intensive opencl workload through opencv dnn on Gen9 integrated graphics. This is on Gentoo with version 23.13.26032.17 (older versions did the same). This dnn works, and runs for many thousands of calls, then segfaults. GDB stacktrace below. Any help would be appreciated, thanks.


[Switching to Thread 0x7ffebf05b6c0 (LWP 27798)]
0x00007ffff4fd8680 in pthread_mutex_lock () from /lib64/libc.so.6
(gdb) backtrace 
#0  0x00007ffff4fd8680 in pthread_mutex_lock () at /lib64/libc.so.6
#1  0x00007fffa489bb66 in __gthread_mutex_lock(__gthread_mutex_t*) (__mutex=0x7ff9c69c7484)
    at /usr/lib/gcc/x86_64-pc-linux-gnu/11/include/g++-v11/x86_64-pc-linux-gnu/bits/gthr-default.h:749
#2  0x00007fffa489bcb8 in std::mutex::lock() (this=0x7ff9c69c7484)
    at /usr/lib/gcc/x86_64-pc-linux-gnu/11/include/g++-v11/bits/std_mutex.h:100
#3  0x00007fffa4970f6a in std::lock_guard<std::mutex>::lock_guard(std::mutex&)
    (this=0x7ffebf058b30, __m=...)
    at /usr/lib/gcc/x86_64-pc-linux-gnu/11/include/g++-v11/bits/std_mutex.h:229
#4  0x00007fffa49f21bf in NEO::MapOperationsHandler::findInfoForHostPtr(void const*, unsigned long, NEO::MapInfo&) (this=0x7ff9c69c7464, ptr=0x7ffe89bf13c0, size=270720, outMapInfo=...)
    at /usr/src/debug/dev-libs/intel-compute-runtime-23.13.26032.17/compute-runtime-23.13.26032.17/opencl/source/mem_obj/map_operations_handler.cpp:65
#5  0x00007fffa49f2641 in NEO::MapOperationsStorage::getInfoForHostPtr(void const*, unsigned long, NEO::MapInfo&) (this=0x7fffb88b6908, ptr=0x7ffe89bf13c0, size=270720, outInfo=...)
    at /usr/src/debug/dev-libs/intel-compute-runtime-23.13.26032.17/compute-runtime-23.13.26032.17/opencl/source/mem_obj/map_operations_handler.cpp:109
#6  0x00007fffa497f5f6 in NEO::Context::tryGetExistingMapAllocation(void const*, unsigned long, NEO::GraphicsAllocation*&)
    (this=0x7fffb88b6770, ptr=0x7ffe89bf13c0, size=270720, allocation=@0x7ffebf058cf8: 0x0)
    at /usr/src/debug/dev-libs/intel-compute-runtime-23.13.26032.17/compute-runtime-23.13.26032.17/opencl/source/context/context.cpp:133
#7  0x00007fffa497f401 in NEO::Context::tryGetExistingHostPtrAllocation(void const*, unsigned long, unsigned int, NEO::GraphicsAllocation*&, InternalMemoryType&, bool&)
    (this=0x7fffb88b6770, ptr=0x7ffe89bf13c0, size=270720, rootDeviceIndex=0, allocation=@0x7ffebf058cf8: 0x0, memoryType=@0x7ffebf058d34: NOT_SPECIFIED, isCpuCopyAllowed=@0x7ffebf058d30: true)
    at /usr/src/debug/dev-libs/intel-compute-runtime-23.13.26032.17/compute-runtime-23.13.26032.17/opencl/source/context/context.cpp:102
#8  0x00007fffa4afe443 in NEO::CommandQueueHw<NEO::Gen9Family>::enqueueReadBuffer(NEO::Buffer*, unsigned int, unsigned long, unsigned long, void*, NEO::GraphicsAllocation*, unsigned int, _cl_event* const*, _cl_event**) (this=0x7fffb80110f0, buffer=0x7ffe89e23b00, blockingRead=1, offset=0, size=270720, ptr=0x7ffe89bf13c0, mapAllocation=0x0, numEventsInWaitList=0, eventWaitList=0x0, event=0x0)
    at /usr/src/debug/dev-libs/intel-compute-runtime-23.13.26032.17/compute-runtime-23.13.26032.17/opencl/source/command_queue/enqueue_read_buffer.h:54
#9  0x00007fffa4881a80 in clEnqueueReadBuffer(cl_command_queue, cl_mem, cl_bool, size_t, size_t, void*, cl_uint, cl_event const*, cl_event*)
    (commandQueue=0x7fffb8011100, buffer=0x7ffe89e23b10, blockingRead=1, offset=0, cb=270720, ptr=0x7ffe89bf13c0, numEventsInWaitList=0, eventWaitList=0x0, event=0x0)
    at /usr/src/debug/dev-libs/intel-compute-runtime-23.13.26032.17/compute-runtime-23.13.26032.17/opencl/source/api/api.cpp:2347
#10 0x0000000000ce18dc in cv::ocl::OpenCLAllocator::download(cv::UMatData*, void*, int, unsigned long const*, unsigned long const*, unsigned long const*, unsigned long const*) const ()
#11 0x0000000000d131b9 in cv::UMat::copyTo(cv::_OutputArray const&) const ()
#12 0x0000000000dfafa2 in cv::dnn::dnn4_v20221220::Net::Impl::getBlob(cv::dnn::dnn4_v20221220::detail::LayerPin const&) const ()
#13 0x0000000000e13761 in cv::dnn::dnn4_v20221220::Net::Impl::forward(cv::_OutputArray const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) ()
#14 0x0000000000df4905 in cv::dnn::dnn4_v20221220::Net::forward(cv::_OutputArray const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) ()```
JablonskiMateusz commented 1 year ago

Hi @richy759 Thanks for reporting the issue. It looks like missing lock in driver. Fix is in progress.

JablonskiMateusz commented 1 year ago

change merged https://github.com/intel/compute-runtime/commit/04afb637177f8e9a9c83fca40d4e73e3f4faa6cb @richy759 could you confirm it is working fine?

richy759 commented 1 year ago

I haven't been able to reproduce this problem, so I'd say it's sorted. Thanks for the speedy fix!