intel / intel-extension-for-pytorch

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
Apache License 2.0
1.62k stars 247 forks source link

Import Error: "libmkl_sycl_blas.so.4" and "libze_loader.so.1" cannot open shared object file: No such file or directory. #651

Closed brendondgr closed 4 months ago

brendondgr commented 5 months ago

Describe the issue

The Issue I am having: When attempting to import this library, I am getting the following error: import intel_extension_for_pytorch as ipex

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[7], [line 1](vscode-notebook-cell:?execution_count=7&line=1)
----> [1](vscode-notebook-cell:?execution_count=7&line=1) import intel_extension_for_pytorch as ipex

File ~/intelpython3/envs/dl/lib/python3.9/site-packages/intel_extension_for_pytorch/__init__.py:95
     [91](https://file+.vscode-resource.vscode-cdn.net/home/bdgr/Documents/Coding%20Tests/~/intelpython3/envs/dl/lib/python3.9/site-packages/intel_extension_for_pytorch/__init__.py:91)                 raise err
     [93](https://file+.vscode-resource.vscode-cdn.net/home/bdgr/Documents/Coding%20Tests/~/intelpython3/envs/dl/lib/python3.9/site-packages/intel_extension_for_pytorch/__init__.py:93)     kernel32.SetErrorMode(prev_error_mode)
---> [95](https://file+.vscode-resource.vscode-cdn.net/home/bdgr/Documents/Coding%20Tests/~/intelpython3/envs/dl/lib/python3.9/site-packages/intel_extension_for_pytorch/__init__.py:95) from .utils._proxy_module import *
     [96](https://file+.vscode-resource.vscode-cdn.net/home/bdgr/Documents/Coding%20Tests/~/intelpython3/envs/dl/lib/python3.9/site-packages/intel_extension_for_pytorch/__init__.py:96) from .utils.utils import has_cpu, has_xpu
     [98](https://file+.vscode-resource.vscode-cdn.net/home/bdgr/Documents/Coding%20Tests/~/intelpython3/envs/dl/lib/python3.9/site-packages/intel_extension_for_pytorch/__init__.py:98) if has_cpu():

File ~/intelpython3/envs/dl/lib/python3.9/site-packages/intel_extension_for_pytorch/utils/_proxy_module.py:2
      [1](https://file+.vscode-resource.vscode-cdn.net/home/bdgr/Documents/Coding%20Tests/~/intelpython3/envs/dl/lib/python3.9/site-packages/intel_extension_for_pytorch/utils/_proxy_module.py:1) import torch
----> [2](https://file+.vscode-resource.vscode-cdn.net/home/bdgr/Documents/Coding%20Tests/~/intelpython3/envs/dl/lib/python3.9/site-packages/intel_extension_for_pytorch/utils/_proxy_module.py:2) import intel_extension_for_pytorch._C
      [5](https://file+.vscode-resource.vscode-cdn.net/home/bdgr/Documents/Coding%20Tests/~/intelpython3/envs/dl/lib/python3.9/site-packages/intel_extension_for_pytorch/utils/_proxy_module.py:5) # utils function to define base object proxy
      [6](https://file+.vscode-resource.vscode-cdn.net/home/bdgr/Documents/Coding%20Tests/~/intelpython3/envs/dl/lib/python3.9/site-packages/intel_extension_for_pytorch/utils/_proxy_module.py:6) def _proxy_module(name: str) -> type:

ImportError: libmkl_sycl_blas.so.4: cannot open shared object file: No such file or directory

Instructions I followed for download: https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu&version=v2.1.30%2bxpu&os=linux%2fwsl2&package=pip

Here are the version in case you don't want to read through it:

From these, this may be overkill, but I added the following paths to .bashrc:

export PATH="/home/bdgr/intel/oneapi/2024.1/bin:$PATH"
export PATH="/home/bdgr/intel/oneapi/compiler/latest/bin:$PATH"
export PATH="/home/bdgr/intel/oneapi/mkl/latest/bin:$PATH"
export PATH="/home/bdgr/intel/oneapi:$PATH"
export PATH="/home/bdgr/intel/oneapi/ccl/latest/bin:$PATH"
export PATH="/home/bdgr/intel/oneapi/mpi/latest/bin:$PATH"
export PATH="/home/bdgr/intel/oneapi/compiler/latest/:$PATH"
export PATH="/home/bdgr/intel/oneapi/mkl/latest/:$PATH"
export PATH="/home/bdgr/intel/oneapi/ccl/latest/:$PATH"
export PATH="/home/bdgr/intel/oneapi/mpi/latest:$PATH"

Other Solutions I have Attempted: I have also activated OneAPI environment as well as the conda environment prior to attempting the import, but this results in a different error. Here is the order of commands (Note, conda env "dl" contains all mentioned libaries): source /home/bdgr/intel/oneapi/setvars.sh conda activate dl python -c "import torch; import intel_extension_for_pytorch as ipex;"

This import (after activating OneAPI) results in the following Error:

/home/bdgr/intelpython3/envs/dl/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libjpeg.so.8: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/bdgr/intelpython3/envs/dl/lib/python3.9/site-packages/intel_extension_for_pytorch/__init__.py", line 95, in <module>
    from .utils._proxy_module import *
  File "/home/bdgr/intelpython3/envs/dl/lib/python3.9/site-packages/intel_extension_for_pytorch/utils/_proxy_module.py", line 2, in <module>
    import intel_extension_for_pytorch._C
ImportError: libze_loader.so.1: cannot open shared object file: No such file or directory

I don't know what this error is. The other one I knew had something to do with OneAPI, but any help would be appreciated! I have also tried to find if there was a GPU/XPU in a Jupyter notebook, but it was not detected.

Specifications: Operating System: Fedora Workstation 40 Graphics Card: Intel Arc A770 CPU: Ryzen 5 3600 Motherboard: MSI B450-A Pro MAX (Updated to Latest Kernel version to support Re-Bar)

brendondgr commented 5 months ago

Addenum - Windows 10 So I wanted to add that whenever I use the Windows 10 instructions and create a conda environment with all of the requirements (proper torch, torchvision and torchaudio versions), it seems to work fine. I had also installed the versions of this that relate to the most recent release of Intel extension for pytorch. However, whenever I add a new library into the mix (such as torchmetrics, monai, etc), it seems to either downgrade, upgrade, or completely remove the previously mentioned requirements.

Here is the error whenever I add in the new libraries, specifically seems to highlight the importing of torch. As I stated, everything was working fine until I had decided to install matplotlib:

import warnings
warnings.filterwarnings("ignore")

# Import Torch, MONAI and other libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision.transforms import Compose, Normalize, ToTensor
from torch.utils.data import DataLoader
# from torch.utils.tensorboard import SummaryWriter
from monai.transforms import Compose, ScaleIntensity, EnsureChannelFirst, Resize, EnsureChannelFirstd
from monai.networks.layers import Norm
from monai.networks.nets import UNet
from monai.losses import DiceLoss
from monai.inferers import sliding_window_inference
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Set random seed
torch.manual_seed(0)

# Import custom classes/functions
from cg_Dataset import load_data

Specifically the error is the following: OSError: [WinError 127] The specified procedure could not be found. Error loading "..\anaconda3\envs\dl_bu\lib\site-packages\torch\lib\backend_with_compiler.dll" or one of its dependencies.

brendondgr commented 5 months ago

My Windows Solution (So far) I'm consistently working out the kinks on this, so as I come up with solutions, I will be sharing them. To ensure that everything works with Windows, I had to do the following order.

  1. Install the environment using the following command, replacing 3.x with the version you want (3.10 personally worked best for me): conda create -n general intelpython3_core python=3.x conda activate general conda install pkg-config libuv
  2. Now install each one of the items you want to install using conda.
  3. When installing matplotlib (or seaborn) there may be an error with "DLL load failed while importing _imaging," this can be resolved using the following series of commands: python -m pip install --force-reinstall Pillow python -m pip install --force-reinstall matplotlib
  4. You may also need to reinstall (using conda) the library "mkl".
  5. Then if needbe, install any other libraries using conda, this will tell you if it is going to downgrade anything relating to Intel. If it does avoid it like the plague, as this seems to mess up the environment in very odd ways...

So far the libraries that I haven't been able to get working are the following:

  1. torchmetrics (Downgrades any 2024.1.2 OneAPI item to 2023.2.4, as well as intelpython!)
vishnumadhu365 commented 5 months ago

@brendondgr any luck navigating through the dependency issue ? The original issue seems related to this

brendondgr commented 5 months ago

@brendondgr any luck navigating through the dependency issue ? The original issue seems related to this

It seems like their issue was also on Windows Distribution, which I had issues with initially. The error involving backend_with_compiler.dll was due to certain libraries (most likely torchmetrics in my case), that was downgrading or completely wiping out some of the Intel Python libraries.

Along with this, I noticed that downloading all of the libraries I need before running the command to download the proper PyTorch and IPEX versions was needed in cases where these libraries would be replaced by the previously installed libraries, as well.

So currently: I am still having issues with the original post still, where I have the import errors for "libmkl_sycl_blas.so.4" and "libze_loader.so.1". Once I activated OneAPI, the "libmkl" error seemed to go away, but now results in the "libze" error. This error specifically occurs on Fedora Workstation 40.

vishnumadhu365 commented 5 months ago

libze_loader error might be due to issues with the level_zero driver. Can you activate oneapi (source /opt/intel/oneapi/setvars.sh), and run the command 'sycl-ls' and check if its properly listing the 'oneapi_level_zero:gpu' ?

brendondgr commented 5 months ago

libze_loader error might be due to issues with the level_zero driver. Can you activate oneapi (source /opt/intel/oneapi/setvars.sh), and run the command 'sycl-ls' and check if its properly listing the 'oneapi_level_zero:gpu' ?

When running the command sycl-ls I get the following output (After activating OneAPI with setvars.sh):

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.3.0.08_160000]
[opencl:cpu:1] Intel(R) OpenCL, AMD Ryzen 5 3600 6-Core Processor               OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
[opencl:cpu:2] Intel(R) OpenCL, AMD Ryzen 5 3600 6-Core Processor               OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [24.09.28717.17]
vishnumadhu365 commented 5 months ago

Seems level_zero support is missing. Have you already tried installing the drivers for A770 ?

vishnumadhu365 commented 5 months ago

Currently, for linux distros I see that A770 drivers are validated only for Ubuntu 22.04. Possible to switch to Ubuntu ?

If not, check if the RHEL install path works for Fedora --> https://dgpu-docs.intel.com/driver/installation.html#red-hat-enterprise-linux-package-repository

image
brendondgr commented 5 months ago

Seems level_zero support is missing. Have you already tried installing the drivers for A770 ?

These should be installed, given when running the command glxinfo | grep -e "OpenGL vendor" -e "OpenGL renderer" results in the following output:

OpenGL vendor string: Intel
OpenGL renderer string: Mesa Intel(R) Arc(tm) A770 Graphics (DG2)

With the drivers being "Mesa". Also running sudo intel_gpu_top also results in the proper functionality. Either way, if there's no clear solution on Fedora I may have to switch back to Ubuntu, but this may take a few hours.

vishnumadhu365 commented 5 months ago

These should be installed, given when running the command glxinfo | grep -e "OpenGL vendor" -e "OpenGL renderer" results in the following output:

OpenGL and Mesa drivers provide the Media runtime which seems to be installed fine. But IPEX needs the Compute runtime through the level zero driver.

Either way, if there's no clear solution on Fedora I may have to switch back to Ubuntu, but this may take a few hours.

Sure, once on Ubuntu 22.04,

  1. follow the driver install guide (sections 3.1.3 and 3.1.4) --> https://dgpu-docs.intel.com/driver/client/overview.html
  2. install oneAPI Base Toolkit 2024.1.0
  3. Activate oneAPI, and verify sycl-ls lists the level_zero gpu device
brendondgr commented 5 months ago

Currently, for linux distros I see that A770 drivers are validated only for Ubuntu 22.04. Possible to switch to Ubuntu ?

So I switched over to Ubuntu 22.04 LTS and everything seems to be functioning properly. The only issue I am now having is that my Jupyter Environment is not detecting the graphics card. I have confirmed several times over that the graphics drivers are installed properly, so I don't believe this is an issue.

However, I was curious if there is a reason why it is not detecting it properly?

Activate oneAPI, and verify sycl-ls lists the level_zero gpu device

Here is my output for sycl-ls on Ubuntu:

(base) bdgr@bdgr-MS-7B86:~$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.5.0.08_160000.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, AMD Ryzen 5 3600 6-Core Processor               OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [23.43.027642]
[opencl:cpu:3] Intel(R) OpenCL, AMD Ryzen 5 3600 6-Core Processor               OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]

Is this what you were looking for or is there still an issue?

Edit: I realize that the level-zero is not available according to that output. I am going to look into other ways to resolve this. If you or anyone has a suggestion, I am all ears.

vishnumadhu365 commented 5 months ago

Is this what you were looking for or is there still an issue?

Still can't see the level_zero gpu getting listed.

Are you able to successfully run the following on the CLI ?:

  1. Sanity check

    python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"

    Imports should work, package versions and the gpu name should get printed

  2. Try running an IPEX sample here

brendondgr commented 5 months ago

Sanity check The output is the following:

2.1.0.post2+cxx11.abi
2.1.30+xpu

Which matches what I am finding in my Jupyter Notebook.

Running an example, such as the ResNet50 Example, results in the following error, which is just a Kernel crashing error:

The Kernel crashed while executing code in the current cell or a previous cell. 
Please review the code in the cell(s) to identify a possible cause of the failure. 
Click [here](https://aka.ms/vscodeJupyterKernelCrash) for more info. 
View Jupyter [log](command:jupyter.viewOutput) for further details.

When looking into the log output, all it says is:

17:06:17.391 [error] Disposing session as kernel process died ExitCode: undefined, Reason: 
vishnumadhu365 commented 5 months ago

Sanity check The output is the following: 2.1.0.post2+cxx11.abi 2.1.30+xpu

Its not listing the GPU name.

One last try, check if the current user is on the 'render group', reference

# check if a group named 'render' exists
stat -c "%G" /dev/dri/render*

# check if current user is in render group
groups ${USER}

# Add current user to render group and refresh terminal
sudo gpasswd -a ${USER} render  && newgrp render

Login to a new terminal >> activate oneapi >> try sycl-ls again

brendondgr commented 5 months ago

Login to a new terminal >> activate oneapi >> try sycl-ls again

It's adding my user to the render group and I have confirmed I have been added. However, sycl-ls still looks the same and doing the "sanity check" does not print out the graphics card.

brendondgr commented 5 months ago

Okay, I believe I resolved it. I wasn't sure what was occurring and I had a sneaking suspicion that it had something to do with the intel-compute-runtime not being installed correctly, so I had run the following commands to install DKMS Kernel modules, then installed the compute, media and display runtimes. First I added the Intel Graphics Repository.

Graphics Repository:

sudo apt-get install -y gpg-agent wget
wget -qO - https://repositories.intel.com/graphics/intel-graphics.key | sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
echo 'deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/graphics/ubuntu jammy arc' | sudo tee /etc/apt/sources.list.d/intel.gpu.jammy.list

Installing the Modules and Runtimes:

sudo apt-get install -y intel-platform-vsec-dkms intel-platform-cse-dkms intel-i915-dkms intel-fw-gpu
sudo apt-get install -y intel-opencl-icd intel-level-zero-gpu level-zero intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2

I then ran my Python code and it found the GPU immediately. Just in case, however, I had also utilized the following command to install the Intel Compute Runtime Package by itself (but I did not check to see if this by itself resolved the issue...):

sudo apt-get install -y intel-opencl-icd

I will keep this open temporarily if there are any further issues or if there are any questions for me.

vishnumadhu365 commented 5 months ago

Oh that's great! Glad that it finally worked

On Sun, 9 Jun, 2024, 3:56 am brendondgr, @.***> wrote:

Okay, I believe I resolved it. I wasn't sure what was occurring and I had a sneaking suspicion that it had something to do with the intel-compute-runtime not being installed correctly, so I had run the following commands to install DKMS Kernel modules, then installed the compute, media and display runtimes. First I added the Intel Graphics Repository.

Graphics Repository:

sudo apt-get install -y gpg-agent wget wget -qO - https://repositories.intel.com/graphics/intel-graphics.key | sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg echo 'deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/graphics/ubuntu jammy arc' | sudo tee /etc/apt/sources.list.d/intel.gpu.jammy.list

Installing the Modules and Runtimes:

sudo apt-get install -y intel-platform-vsec-dkms intel-platform-cse-dkms intel-i915-dkms intel-fw-gpu sudo apt-get install -y intel-opencl-icd intel-level-zero-gpu level-zero intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2

I then ran my Python code and it found the GPU immediately. Just in case, however, I had also utilized the following command to install the Intel Compute Runtime Package by itself (but I did not check to see if this by itself resolved the issue...):

sudo apt-get install -y intel-opencl-icd

I will keep this open temporarily if there are any further issues or if there are any questions for me.

— Reply to this email directly, view it on GitHub https://github.com/intel/intel-extension-for-pytorch/issues/651#issuecomment-2156209536, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIB4WE3QYYR3QNPAWFIGN23ZGOAJ7AVCNFSM6AAAAABI3MNINOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJWGIYDSNJTGY . You are receiving this because you were assigned.Message ID: @.***>

brendondgr commented 4 months ago

I haven't done too much coding with deep learning since last posting, but when I had it seems to be working properly. I'm going to close this now since the previous steps seemed to have worked so far.

Thank you @vishnumadhu365 for the assistance and ideas. :)