oneapi-src / level-zero

oneAPI Level Zero Specification Headers and Loader
https://spec.oneapi.com/versions/latest/elements/l0/source/index.html
MIT License
211 stars 90 forks source link

Finding the cause for ZE_RESULT_ERROR_UNINITIALIZED #140

Open maleadt opened 5 months ago

maleadt commented 5 months ago

I'm working on oneAPI.jl, which provides Julia support for Intel GPUs through Level Zero. Occasionally, we run into users reporting that they run into an opaque ZE_RESULT_ERROR_UNINITIALIZED when we call zeInit during loading of oneAPI.jl. This is an unhelpful error, and it makes it impossible to use the Level Zero APIs to figure out what's actually happening. For example, I've run into:

Apart from the last one, I wouldn't expect the loader to fail to initialize, but still allow iterating drivers (why else this abstraction?) and ideally being able to determine why there's no devices. Currently, we typically find this out after a painstaking remote debugging session using strace or LD_DEBUG.

Am I missing something in the API here? CUDA for example has error codes that indicate at least a little better what may be happening happening (CUDA_ERROR_NO_DEVICE, CUDA_ERROR_DEVICE_UNAVAILABLE, CUDA_ERROR_DEVICE_NOT_LICENSED, etc).


Apart from the above API issue, I also have a concrete case where a user's system keeps on throwing ZE_RESULT_ERROR_UNINITIALIZED: https://github.com/JuliaGPU/oneAPI.jl/issues/399. LD_DEBUG reveals that the correct libraries are found, and strace shows that /dev/dri nodes are successfully discovered and opened.

I've found out about some environment variables to increase logging, but the output isn't very helpful:

❯ ZE_ENABLE_LOADER_DEBUG_TRACE=1 julia ...

ZE_LOADER_DEBUG_TRACE:Loading Driver libze_intel_gpu.so.1
ZE_LOADER_DEBUG_TRACE:Loading Driver libze_intel_vpu.so.1
ZE_LOADER_DEBUG_TRACE:Load Library of libze_intel_vpu.so.1 failed with libze_intel_vpu.so.1: cannot open shared object file: No such file or directory
ZE_LOADER_DEBUG_TRACE:Load Library of libze_tracing_layer.so.1 failed with libze_tracing_layer.so.1: cannot open shared object file: No such file or directory
ZE_LOADER_DEBUG_TRACE:check_drivers(flags=0(ZE_INIT_ALL_DRIVER_TYPES_ENABLED))
ZE_LOADER_DEBUG_TRACE:init driver libze_intel_gpu.so.1 zeInit(0(ZE_INIT_ALL_DRIVER_TYPES_ENABLED)) returning ZE_RESULT_ERROR_UNINITIALIZED
ZE_LOADER_DEBUG_TRACE:Check Drivers Failed on libze_intel_gpu.so.1 , driver will be removed. zeInit failed with ZE_RESULT_ERROR_UNINITIALIZED
❯ NEOReadDebugKeys=1 PrintDebugMessages=1 PrintXeLogs=1 julia ...
...
INFO: System Info query failed!
WARNING: Failed to request OCL Turbo Boost
ZE_LOADER_DEBUG_TRACE:init driver libze_intel_gpu.so.1 zeInit(0(ZE_INIT_ALL_DRIVER_TYPES_ENABLED)) returning ZE_RESULT_ERROR_UNINITIALIZED
ZE_LOADER_DEBUG_TRACE:Check Drivers Failed on libze_intel_gpu.so.1 , driver will be removed. zeInit failed with ZE_RESULT_ERROR_UNINITIALIZED

Any other suggestions on how to debug this would be much appreciated.

eero-t commented 5 months ago

Apart from the above API issue, I also have a concrete case where a user's system keeps on throwing ZE_RESULT_ERROR_UNINITIALIZED: https://github.com/JuliaGPU/oneAPI.jl/issues/399. LD_DEBUG reveals that the correct libraries are found, and strace shows that /dev/dri nodes are successfully discovered and opened.

Is this with the very latest kernel?

I.e. does this help:

export NEOReadDebugKeys=1
export OverrideGpuAddressSpace=48

See: https://github.com/intel/compute-runtime/issues/710

eero-t commented 5 months ago

Having debugged several of these issues, I think this is rather important bug...

For example, I've run into:

  • users not having a (supported) GPU

Or there being some mismatch between user-space and kernel driver:

  • restrictive permissions on /dev/dri
  • conflicting library versions picked up (e.g. redistributed libze_loader vs system libze_tracing_layer)

Or frontend implementing zesInit(), but backend being older one that does not implement it (like is case in Ubuntu 23.10): https://github.com/intel/compute-runtime/issues/650

Apart from the last one, I wouldn't expect the loader to fail to initialize, but still allow iterating drivers (why else this abstraction?) and ideally being able to determine why there's no devices. Currently, we typically find this out after a painstaking remote debugging session using strace or LD_DEBUG.

Am I missing something in the API here? CUDA for example has error codes that indicate at least a little better what may be happening happening (CUDA_ERROR_NO_DEVICE, CUDA_ERROR_DEVICE_UNAVAILABLE, CUDA_ERROR_DEVICE_NOT_LICENSED, etc).

Looking at current Level-Zero frontend sources, it returns "unitialized" error for zesInit() regardless of whether zesInit() support is missing from backend, or backend function returned some error (e.g. because there was no GPU).

Apart from the above API issue, I also have a concrete case where a user's system keeps on throwing ZE_RESULT_ERROR_UNINITIALIZED: JuliaGPU/oneAPI.jl#399. LD_DEBUG reveals that the correct libraries are found, and strace shows that /dev/dri nodes are successfully discovered and opened. ... Any other suggestions on how to debug this would be much appreciated.

Using -k (stacktrace) option for strace can give some additional clues on where things fail.