Open maleadt opened 5 months ago
Apart from the above API issue, I also have a concrete case where a user's system keeps on throwing ZE_RESULT_ERROR_UNINITIALIZED: https://github.com/JuliaGPU/oneAPI.jl/issues/399. LD_DEBUG reveals that the correct libraries are found, and strace shows that /dev/dri nodes are successfully discovered and opened.
Is this with the very latest kernel?
I.e. does this help:
export NEOReadDebugKeys=1
export OverrideGpuAddressSpace=48
Having debugged several of these issues, I think this is rather important bug...
For example, I've run into:
- users not having a (supported) GPU
Or there being some mismatch between user-space and kernel driver:
- restrictive permissions on
/dev/dri
- conflicting library versions picked up (e.g. redistributed
libze_loader
vs systemlibze_tracing_layer
)
Or frontend implementing zesInit()
, but backend being older one that does not implement it (like is case in Ubuntu 23.10): https://github.com/intel/compute-runtime/issues/650
Apart from the last one, I wouldn't expect the loader to fail to initialize, but still allow iterating drivers (why else this abstraction?) and ideally being able to determine why there's no devices. Currently, we typically find this out after a painstaking remote debugging session using
strace
orLD_DEBUG
.Am I missing something in the API here? CUDA for example has error codes that indicate at least a little better what may be happening happening (
CUDA_ERROR_NO_DEVICE
,CUDA_ERROR_DEVICE_UNAVAILABLE
,CUDA_ERROR_DEVICE_NOT_LICENSED
, etc).
Looking at current Level-Zero frontend sources, it returns "unitialized" error for zesInit()
regardless of whether zesInit()
support is missing from backend, or backend function returned some error (e.g. because there was no GPU).
Apart from the above API issue, I also have a concrete case where a user's system keeps on throwing
ZE_RESULT_ERROR_UNINITIALIZED
: JuliaGPU/oneAPI.jl#399.LD_DEBUG
reveals that the correct libraries are found, andstrace
shows that/dev/dri
nodes are successfully discovered and opened. ... Any other suggestions on how to debug this would be much appreciated.
Using -k
(stacktrace) option for strace
can give some additional clues on where things fail.
I'm working on oneAPI.jl, which provides Julia support for Intel GPUs through Level Zero. Occasionally, we run into users reporting that they run into an opaque
ZE_RESULT_ERROR_UNINITIALIZED
when we callzeInit
during loading of oneAPI.jl. This is an unhelpful error, and it makes it impossible to use the Level Zero APIs to figure out what's actually happening. For example, I've run into:/dev/dri
libze_loader
vs systemlibze_tracing_layer
)Apart from the last one, I wouldn't expect the loader to fail to initialize, but still allow iterating drivers (why else this abstraction?) and ideally being able to determine why there's no devices. Currently, we typically find this out after a painstaking remote debugging session using
strace
orLD_DEBUG
.Am I missing something in the API here? CUDA for example has error codes that indicate at least a little better what may be happening happening (
CUDA_ERROR_NO_DEVICE
,CUDA_ERROR_DEVICE_UNAVAILABLE
,CUDA_ERROR_DEVICE_NOT_LICENSED
, etc).Apart from the above API issue, I also have a concrete case where a user's system keeps on throwing
ZE_RESULT_ERROR_UNINITIALIZED
: https://github.com/JuliaGPU/oneAPI.jl/issues/399.LD_DEBUG
reveals that the correct libraries are found, andstrace
shows that/dev/dri
nodes are successfully discovered and opened.I've found out about some environment variables to increase logging, but the output isn't very helpful:
Any other suggestions on how to debug this would be much appreciated.