intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.26k stars 740 forks source link

SYCL: the level_zero backend is not detected #6342

Open fcharras opened 2 years ago

fcharras commented 2 years ago

Describe the bug After following the getting started guide and other connected documentation I can build the simple-sycl-app.exe, it runs fine but not with the level_zero backend:

root@5aad1915b0db:/project# SYCL_DEVICE_FILTER="level_zero:gpu" ./simple-sycl-app.exe 
terminate called after throwing an instance of 'cl::sycl::runtime_error'
  what():  No device of requested type available. -1 (CL_DEVICE_NOT_FOUND)

Aborted (core dumped)
root@5aad1915b0db:/project# SYCL_DEVICE_FILTER="opencl:gpu" ./simple-sycl-app.exe 
The results are correct!

Several issues to report:

    > product: Celeron N3350/Pentium N4200/Atom E3900 Series Integrated Graphics Controller
    > capabilities: pciexpress msi pm vga_controller bus_master cap_list rom
    > configuration: driver=i915 latency=0

To Reproduce

I've built the sycl branch following instructions in https://intel.github.io/llvm-docs/GetStartedGuide.html#install-low-level-runtime from within the ubuntu2004_base container, then compile the simple-sycl-app.exe with clang++ -fsycl simple-sycl-app.cpp -o simple-sycl-app.exe.

I've ran the app in various environment:

with the same outcome. Also tried the dpctl python packages, gpctl.get_devices() would indeed show opencl gpu backend but not level_zero backend.

Not sure if I should look towards a hardware compatibility issue with level_zero here or with the intel runtime, or how to gather more information ?

Environment (please complete the following information):

AlexeySachkov commented 2 years ago

Hi @fcharras,

when a device is not found, the app exit with a segfault rather than a clean exit with an error code.

Yeah, that is a result of missing error handling in our simple-app example: in SYCL errors are reported through exceptions and since there are no "global" try..catch block in the sample, it means that all errors will cause crashes due to unhandled exceptions.

I'm not sure what is wrong with the level_zero backend here. Is it at build time (is it mandatory to pass the l0 headers flag to get level_zero support ?), an issue with the level_zero runtime, or could it be that my device is not compatible ? how can I figure out ?

There are several layers where the problem could occur: it could be that L0 PI plugin (an interface between SYCL runtime and L0 runtime) is not built/found/properly loaded; it could be that the L0 itself can't be loaded, etc.

For starters, I would suggest to launch your app with SYCL_PI_TRACE=1 env variable set, it will print the info about loaded plugins in the following form:

SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_opencl.so [ PluginVersion: 10.12.1 ]
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_level_zero.so [ PluginVersion: 10.12.1 ]

Depending on the output, we should be able to better understand where to proceed with further investigation.

is it mandatory to pass the l0 headers flag to get level_zero support ?

L0 plugin should be built by default. If you don't specify both L0_INCLUDE_DIR and L0_LIBRARY, then our cmake script should automatically download them from corresponding github repos.

fcharras commented 2 years ago

Hello @AlexeySachkov thank you for the support

At this day I'm still unable to run the level_zero backend and I've moved forward using the opencl backend. I'm still interested to use the level_zero backend to compare performances to opencl and I'll gladly expose more information if you need.

I've been working from inside a docker container built on ghcr.io/intel/llvm/ubuntu2004_intel_drivers, the full dockerfile can be found here: https://github.com/soda-inria/sklearn-numba-dpex/blob/main/docker/Dockerfile and the image is available to pull at jjerphan/numba_dpex_dev:latest , must be ran with --device=/dev/dri to enable gpu passthrough.

Other users of this container report the same issue about missing level_zero backend within the container accross several different computers.

The SYCL_PI_TRACE only show that opencl is loaded, either gpu or cpu depending on the filter

SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_opencl.so
SYCL_PI_TRACE[all]: Selected device ->
SYCL_PI_TRACE[all]:   platform: Intel(R) OpenCL
SYCL_PI_TRACE[all]:   device: 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz
SYCL_PI_TRACE[all]: Selected device ->
SYCL_PI_TRACE[all]:   platform: Intel(R) OpenCL HD Graphics
SYCL_PI_TRACE[all]:   device: Intel(R) UHD Graphics [0x9a60]

Querying for level_zero backend just returns no device (using dpctl:

  File "dpctl/_sycl_device_factory.pyx", line 359, in dpctl._sycl_device_factory.select_default_device
dpctl._sycl_device.SyclDeviceCreationError: Default device is unavailable.

)

I do have the level_zero libraries and headers installed in the container:

root@6d3f8d1418a6:/# locate level_zero
/opt/intel/oneapi/compiler/2022.1.0/linux/include/sycl/CL/sycl/backend/level_zero.hpp
/opt/intel/oneapi/compiler/2022.1.0/linux/include/sycl/CL/sycl/detail/backend_traits_level_zero.hpp
/opt/intel/oneapi/compiler/2022.1.0/linux/include/sycl/ext/oneapi/backend/level_zero.hpp
/opt/intel/oneapi/compiler/2022.1.0/linux/include/sycl/ext/oneapi/backend/level_zero_ownership.hpp
/opt/intel/oneapi/compiler/2022.1.0/linux/lib/libpi_level_zero.so
/opt/venv/lib/libpi_level_zero.so

in the same folder than opencl library:

/opt/intel/oneapi/compiler/2022.1.0/linux/lib/libpi_opencl.so
/opt/venv/lib/libpi_opencl.so

with /opt/venv/lib being in the LD_LIBRARY_PATH.

General informations on my current host system:

AlexeySachkov commented 2 years ago

@fcharras, so, about SYCL_PI_TRACE: I misguided you a bit. Could you please launch your app under SYCL_PI_TRACE=-1 without setting SYCL_DEVICE_FILTER env variable?

Trace level -1 should also print libraries which it attempted to load, but failed. Could you please also try ldd on that libpi_level_zero.so to see if it has any unresolved dependencies maybe?

fcharras commented 2 years ago

Here's the stderr with SYCL_PI_TRACE=-1:

...
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_opencl.so
SYCL_PI_TRACE[-1]: dlopen(libpi_level_zero.so) failed with <libze_loader.so.1: cannot open shared object file: No such file or directory>
SYCL_PI_TRACE[all]: Check if plugin is present. Failed to load plugin: libpi_level_zero.so
SYCL_PI_TRACE[-1]: dlopen(libpi_cuda.so) failed with <libpi_cuda.so: cannot open shared object file: No such file or directory>
SYCL_PI_TRACE[all]: Check if plugin is present. Failed to load plugin: libpi_cuda.so
SYCL_PI_TRACE[-1]: dlopen(libpi_hip.so) failed with <libpi_hip.so: cannot open shared object file: No such file or directory>
SYCL_PI_TRACE[all]: Check if plugin is present. Failed to load plugin: libpi_hip.so
SYCL_PI_TRACE[-1]: dlopen(libpi_esimd_emulator.so) failed with <libpi_esimd_emulator.so: cannot open shared object file: No such file or directory>
SYCL_PI_TRACE[all]: Check if plugin is present. Failed to load plugin: libpi_esimd_emulator.so
...

and indeed it turns out libze_loader.so.1 is not found:

root@6d3f8d1418a6:/opt/venv/lib# ldd libpi_level_zero.so 
    linux-vdso.so.1 (0x00007fff73bdf000)
    libze_loader.so.1 => not found
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f6ef69f8000)
    libsvml.so => /opt/venv/lib/libsvml.so (0x00007f6ef4996000)
    libirng.so => /opt/venv/lib/libirng.so (0x00007f6ef462c000)
    libimf.so => /opt/venv/lib/libimf.so (0x00007f6ef3f9e000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f6ef3e4f000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f6ef3e34000)
    libintlc.so.5 => /opt/venv/lib/libintlc.so.5 (0x00007f6ef3bbc000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f6ef3bb6000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6ef39c4000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f6ef6b24000)

here's what does exist in the container refering to libze:

root@6d3f8d1418a6:/opt/venv/lib# locate libze       
/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1
/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1.3.23599
/var/lib/libze_intel_gpu
/var/lib/libze_intel_gpu/pci_bind_status_file
/var/lib/libze_intel_gpu/wedged_file

This is probably the culprit ! how should I install this library, is there a reason it's not pre-installed in the base ghcr.io/intel/llvm/ubuntu2004_intel_drivers image ?

fcharras commented 2 years ago

This is probably the culprit ! how should I install this library, is there a reason it's not pre-installed in the base ghcr.io/intel/llvm/ubuntu2004_intel_drivers image ?

In the meantime, installing https://github.com/oneapi-src/level-zero/releases/tag/v1.8.5 does indeed enable the level_zero backend !

AlexeySachkov commented 2 years ago

is there a reason it's not pre-installed in the base ghcr.io/intel/llvm/ubuntu2004_intel_drivers image?

It is most likely happened due to human error. My understanding is that those images were provided on a voluntary basis by one of our colleagues and I'm not sure that we test them thoroughly, so here we are.

how should I install this library

This is probably the culprit ! how should I install this library, is there a reason it's not pre-installed in the base ghcr.io/intel/llvm/ubuntu2004_intel_drivers image ?

In the meantime, installing https://github.com/oneapi-src/level-zero/releases/tag/v1.8.5 does indeed enable the level_zero backend !

Glad you figured it out! I was just going to paste the same link

AlexeySachkov commented 2 years ago

is there a reason it's not pre-installed in the base ghcr.io/intel/llvm/ubuntu2004_intel_drivers image?

It is most likely happened due to human error. My understanding is that those images were provided on a voluntary basis by one of our colleagues and I'm not sure that we test them thoroughly, so here we are.

Ok, I was wrong here, we actually have a workflow which automatically updates those docker images here: https://github.com/intel/llvm/actions/workflows/sycl_containers.yaml

From what I see, we simply install an intel/compute-runtime release such as 22.31.23852, for example. According to the release description, level zero libraries should be there, but apparently something went wrong

KornevNikita commented 6 months ago

Hi! There have been no updates for at least the last 60 days, though the ticket has assignee(s).

@AlexeySachkov, could I ask you to take one of the following actions? :)

Thanks!

github-actions[bot] commented 4 months ago

Hi! There have been no updates for at least the last 60 days, though the issue has assignee(s).

@AlexeySachkov, could you please take one of the following actions:

Thanks!

github-actions[bot] commented 2 months ago

Hi! There have been no updates for at least the last 60 days, though the issue has assignee(s).

@AlexeySachkov, could you please take one of the following actions:

Thanks!

github-actions[bot] commented 1 week ago

Hi! There have been no updates for at least the last 60 days, though the issue has assignee(s).

@AlexeySachkov, could you please take one of the following actions:

Thanks!