intel / compute-runtime

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver
MIT License
1.1k stars 229 forks source link

segment fault during level zero lib exit on CentOS7.4 #675

Closed bosheng1 closed 10 months ago

bosheng1 commented 10 months ago

ran xpu-smi on dGPU environemnt, met segement fault. centos7.4 met this issue, ubuntu20.04 works well. levelzero source: repository: https://github.com/intel/compute-runtime branch: releases/23.22 revision: e75654a07f269d49d74bd8e32a08ded38da0955e

xpu-smi discovery

+-----------+--------------------------------------------------------------------------------------+ | Device ID | Device Information | +-----------+--------------------------------------------------------------------------------------+ | 0 | Device Name: Intel(R) Data Center GPU Flex 140 | | | Vendor Name: Intel(R) Corporation | | | UUID: 00000000-0000-0000-9e34-11e0b30e7c0a | | | PCI BDF Address: 0000:9e:00.0 | | | DRM Device: /dev/dri/card0 | | | Function Type: physical | +-----------+--------------------------------------------------------------------------------------+ | 1 | Device Name: Intel(R) Data Center GPU Flex 140 | | | Vendor Name: Intel(R) Corporation | | | UUID: 00000000-0000-0000-f033-d4dbd6c46f8f | | | PCI BDF Address: 0000:a2:00.0 | | | DRM Device: /dev/dri/card1 | | | Function Type: physical | +-----------+--------------------------------------------------------------------------------------+ Segmentation fault (core dumped)

gdb callback

0 0x00007ffff6873598 in __memcpy_ssse3_back () from /lib64/libc.so.6

1 0x00007ffff72dda74 in std::string::append(std::string const&) () from /lib64/libstdc++.so.6

2 0x00007ffff1c8eb38 in driverHandleDestructor () at /home/media/compute-runtime/level_zero/core/source/linux/driver_teardown.cpp:31

void attribute((destructor)) driverHandleDestructor() { std::string loaderLibraryName= "lib" + L0::loaderLibraryFilename + ".so.1"; L0::setDriverTeardownHandleInLoader(loaderLibraryName); L0::globalDriverTeardown(); } after triage, found variable L0::loaderLibraryFilename is released during driverHandleDestructor, so segment fault happen. it works, when using std::string loaderLibraryName = "libze_loader.so.1"; maybe registering atexit callback is better.

JablonskiMateusz commented 10 months ago

Hi @bosheng1. Thanks for reporting the issue. Could you please check if https://github.com/intel/compute-runtime/commit/4f68822a78d8426e67921587366e4333c08e1a4a commit helps for the issue?

bosheng1 commented 10 months ago

@JablonskiMateusz thanks for your quick fix! no segment fault is found with picking up commit https://github.com/intel/compute-runtime/commit/4f68822a78d8426e67921587366e4333c08e1a4a