intel / intel-extension-for-pytorch

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
Apache License 2.0
1.57k stars 241 forks source link

RuntimeError: Number of dpcpp devices should be greater than zero! #287

Open axel588 opened 1 year ago

axel588 commented 1 year ago

Hello, I used the gpu configuration oneAPI is installed correctly I am in a python virtual environment ai_tr I have this issue with Pytorch, the two import are on the top of the file : import torch import intel_extension_for_pytorch as ipex

I runned: source ${ONEAPI_ROOT}/setvars.s with output :

(ai_tr) axel@Artishima:~/ai_tr/cod$ source ${ONEAPI_ROOT}/setvars.sh

:: WARNING: setvars.sh has already been run. Skipping re-execution.
   To force a re-execution of setvars.sh, use the '--force' option.
   Using '--force' can result in excessive use of your environment variables.

usage: source setvars.sh [--force] [--config=file] [--help] [...]
  --force        Force setvars.sh to re-run, doing so may overload environment.
  --config=file  Customize env vars using a setvars.sh configuration file.
  --help         Display this help message and exit.
  ...            Additional args are passed to individual env/vars.sh scripts
                 and should follow this script's arguments.

  Some POSIX shells do not accept command-line options. In that case, you can pass
  command-line options via the SETVARS_ARGS environment variable. For example:

  $ SETVARS_ARGS="ia32 --config=config.txt" ; export SETVARS_ARGS
  $ . path/to/setvars.sh

  The SETVARS_ARGS environment variable is cleared on exiting setvars.sh.

With --force :

(ai_tr) axel@Artishima:~/ai_tr/cod$ source ${ONEAPI_ROOT}/setvars.sh --force

:: initializing oneAPI environment ...
   -bash: BASH_VERSION = 5.1.16(1)-release
   args: Using "$@" for setvars.sh arguments: --force
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: vpl -- latest
:: vtune -- latest
:: oneAPI environment initialized ::

But this error keep appearing whenether I try to run my training python file: xpu /home/axel/ai_tr/lib/python3.10/site-packages/intel_extension_for_pytorch/xpu/lazy_init.py:73: UserWarning: DPCPP Device count is zero! (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/runtime/Device.cpp:120.) _C._initExtension() /home/axel/ai_tr/lib/python3.10/site-packages/torch/nn/modules/module.py:985: UserWarning: dpcppSetDevice: device_id is out of range (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/runtime/Device.cpp:159.) return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) Traceback (most recent call last): File "/home/axel/ai_tr/cod/train.py", line 190, in <module> m = model.to(device) File "/home/axel/ai_tr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 987, in to return self._apply(convert) File "/home/axel/ai_tr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 639, in _apply module._apply(fn) File "/home/axel/ai_tr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 662, in _apply param_applied = fn(param) File "/home/axel/ai_tr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 985, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: Number of dpcpp devices should be greater than zero!

Everything related to mkl is installed correctl and path are set correctly and working, I am on ubuntu 22.04 using torch 13.1 on WSL 2 on windows 11 with intel drivers installed on windows 11 on Arc 770 with i9 13900K.

The error is trigerred here :


#the line below is triggering the error
m = model.to(device)
m = ipex.optimize(m)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)```

Also the name optimize is a weird naming.

It does seems to be an out of range issue, I have no idea how to solve this issue.
axel588 commented 1 year ago

I updated to pytorch 1.13.1 instead of 1.13.a0 ... should update documentation,

I have this error :


Traceback (most recent call last):
  File "/home/axel/ai_tr/cod/train.py", line 2, in <module>
    import intel_extension_for_pytorch as ipex
  File "/home/axel/ai_tr/lib/python3.10/site-packages/intel_extension_for_pytorch/__init__.py", line 2, in <module>
    from . import cpu
  File "/home/axel/ai_tr/lib/python3.10/site-packages/intel_extension_for_pytorch/cpu/__init__.py", line 2, in <module>
    from . import runtime
  File "/home/axel/ai_tr/lib/python3.10/site-packages/intel_extension_for_pytorch/cpu/runtime/__init__.py", line 3, in <module>
    from .multi_stream import MultiStreamModule, get_default_num_streams, \
  File "/home/axel/ai_tr/lib/python3.10/site-packages/intel_extension_for_pytorch/cpu/runtime/multi_stream.py", line 4, in <module>
    import intel_extension_for_pytorch._C as core
ImportError: /home/axel/ai_tr/lib/python3.10/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so: undefined symbol: _ZNK5torch8autograd4Node4nameB5cxx11Ev

Much weirder is the package it asked to install :

(ai_tr) axel@Artishima:~/ai_tr/cod$ python -m pip install torch==1.13.1 -f https://developer.intel.com/ipex-whl-stable-xpu
Looking in links: https://developer.intel.com/ipex-whl-stable-xpu
DEPRECATION: The HTML index page being used (https://www.intel.com/content/dam/develop/external/us/en/documents/ipex/whl-stable-xpu.html) is not a proper HTML 5 document. This is in violation of PEP 503 which requires these pages to be well-formed HTML 5 documents. Please reach out to the owners of this index page, and ask them to update this index page to a valid HTML 5 document. pip 22.2 will enforce this behaviour change. Discussion can be found at https://github.com/pypa/pip/issues/10825
Collecting torch==1.13.1
  Downloading torch-1.13.1-cp310-cp310-manylinux1_x86_64.whl (887.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 887.5/887.5 MB 2.0 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc-cu11==11.7.99
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.0/21.0 MB 65.7 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu11==8.5.0.96
  Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 557.1/557.1 MB 6.5 MB/s eta 0:00:00
Collecting nvidia-cuda-runtime-cu11==11.7.99
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 849.3/849.3 KB 46.9 MB/s eta 0:00:00
Collecting nvidia-cublas-cu11==11.10.3.66
  Downloading nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 317.1/317.1 MB 18.9 MB/s eta 0:00:00
Requirement already satisfied: typing-extensions in /home/axel/ai_tr/lib/python3.10/site-packages (from torch==1.13.1) (4.4.0)
Collecting wheel
  Downloading wheel-0.38.4-py3-none-any.whl (36 kB)
Requirement already satisfied: setuptools in /home/axel/ai_tr/lib/python3.10/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch==1.13.1) (59.6.0)
Installing collected packages: wheel, nvidia-cuda-nvrtc-cu11, nvidia-cuda-runtime-cu11, nvidia-cublas-cu11, nvidia-cudnn-cu11, torch
  Attempting uninstall: torch
    Found existing installation: torch 1.13.0a0+gitb1dde16
    Uninstalling torch-1.13.0a0+gitb1dde16:
      Successfully uninstalled torch-1.13.0a0+gitb1dde16
Successfully installed nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 torch-1.13.1 wheel-0.38.4

I reinstalled from wheel pytorch :

pip install torch-1.13.0a0+gitb1dde16-cp310-cp310-linux_x86_64.whl

still the same error : number of dpcpp devices ..

jingxu10 commented 1 year ago

@sanchitintel Would you help to check this issue?

gujinghui commented 1 year ago

Please install the torch 1.13.0 provided in below page. https://github.com/intel/intel-extension-for-pytorch/releases/tag/v1.13.10%2Bxpu

asirvaiy commented 1 year ago

@axel588 could you tell, what else are you importing before line m =model.to(device)? If you are importing matplotlib.pyplot before it, try removing it.

sanchitintel commented 1 year ago

Hi @alex588, can you please share your full script? Please ensure that the lines import torch and import intel_extension_for_pytorch are present before any other module's imports in your python script. We had encountered a similar issue & have been using such a workaround.

kns1966 commented 1 year ago

I've seen the same error under WSL2 + Ubuntu 22.04 (Arc a770) with the GPU installation. Most recently, I tried the official releases, listed above, for python 3.9.x and 3.10.x.

In addition, running on the CPU will return a warning about the number of dpcpp devices. It also just crashed my machine. Perhaps available memory was exhausted.

nathanodle commented 1 year ago

Check that you have resizable BAR enabled in the system BIOS

charitarthchugh commented 1 year ago

I am facing the same issue. I have checked if ReBAR is enabled using lspci -v. image The code that I am using is the starter code in the readme for the GPU. I am also using the recommended packages as well from the README. System Details: Fedora 38 i7-12700H ArcA370M Appropriate Torchvision package (0.14.0) was installed with --no-check-deps image

jingxu10 commented 1 year ago

Could you run https://github.com/intel/intel-extension-for-pytorch/raw/master/scripts/collect_env.py and share the outputs?

charitarthchugh commented 1 year ago

Sure! here is the output: https://pastebin.com/ZFcLesPB

jingxu10 commented 1 year ago

please use oneapi basekit 2023.0 with 1.13.10. Also, it seems like you don't have level-0 installed. Please install it as well. Driver version better to be 540, as shown in the installation guide.

DPCPP runtime version: 2023.1.0 <====================
MKL version: 2023.1.0 <===========================
GPU models and configuration: 

Intel OpenCL ICD version: 23.05.25593.18-1.fc38
Level Zero version: N/A <==========================

dnf/yum install -y intel-opencl level-zero intel-level-zero-gpu

By the way, we will have a new release soon. Probably you can try the new version directly soon.

charitarthchugh commented 1 year ago

I am trying to install level-zero. There seems to be no repos for fedora, so would I be able to install the RHEL one?

jingxu10 commented 1 year ago

Since Fedora is not in the list of our verified OSs, we cannot provide support unfortunately. But personally I would recommend you to have a try with the RHEL one.

charitarthchugh commented 1 year ago

After installing the RHEL packages for level-zero, I am faced with the following error:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/fx/graph_module.py", line 658, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
  File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/fx/graph_module.py", line 277, in __call__
    raise e
  File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/fx/graph_module.py", line 267, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
  File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "<eval_with_key>.2", line 5, in forward
  File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: could not create an engine

This is when calling model(data)

jingxu10 commented 1 year ago

Would you try compile from source with the latest code that works with basekit 2023.1 and the latest driver 602? https://github.com/intel/intel-extension-for-pytorch/blob/xpu-master/scripts/compile_bundle.sh

charitarthchugh commented 1 year ago

Will try in the coming days.

qzx1013 commented 1 year ago

@jingxu10 I have installed all what required but I get this output from https://github.com/intel/intel-extension-for-pytorch/raw/master/scripts/collect_env.py.

Collecting environment information... PyTorch version: 1.13.0a0+git6c9b55e PyTorch CXX11 ABI: Yes IPEX version: 1.13.120+xpu IPEX commit: https://github.com/intel/intel-extension-for-pytorch/commit/c2a37012e9eeb20c317cfd00583da23ca538d2b2 Build type: Release

OS: Ubuntu 22.04.2 LTS (x86_64) GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0 Clang version: N/A IGC version: N/A CMake version: version 3.22.1 Libc version: glibc-2.35

Python version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] (64-bit runtime) Python platform: Linux-5.19.0-41-generic-x8664-with-glibc2.35 Is XPU available: False DPCPP runtime version: N/A MKL version: N/A_ GPU models and configuration:

Intel OpenCL ICD version: 23.05.25593.18-60122.04 Level Zero version: 1.3.25593.18-60122.04

CPU: 架构: x86_64 CPU 运行模式: 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual 字节序: Little Endian CPU: 24 在线 CPU 列表: 0-23 厂商 ID: AuthenticAMD 型号名称: AMD Ryzen 9 5900X 12-Core Processor CPU 系列: 25 型号: 33 每个核的线程数: 2 每个座的核数: 12 座: 1 步进: 0 Frequency boost: enabled CPU 最大 MHz: 4950.1948 CPU 最小 MHz: 2200.0000 BogoMIPS: 7399.63 标记: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm 虚拟化: AMD-V L1d 缓存: 384 KiB (12 instances) L1i 缓存: 384 KiB (12 instances) L2 缓存: 6 MiB (12 instances) L3 缓存: 64 MiB (2 instances) NUMA 节点: 1 NUMA 节点0 CPU: 0-23 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] intel-extension-for-pytorch==1.13.120+xpu [pip3] numpy==1.21.5 [pip3] torch==1.13.0a0+git6c9b55e [pip3] torchvision==0.14.1a0+5e8e2f1 [conda] intel-extension-for-pytorch 1.13.120+xpu pypi_0 pypi [conda] numpy 1.24.3 pypi_0 pypi [conda] torch 1.13.0a0+git6c9b55e pypi_0 pypi [conda] torchvision 0.14.1a0+5e8e2f1 pypi_0 pypi

fredlarochelle commented 1 year ago

@qzx1013 Not from Intel, but I might be able to assist you.

Firstly, if your DPCPP runtime version and MKL version are showing as N/A, it implies that you haven't properly initialized your environnement. You need to source both the DPCPP and MKL as follows:

source /opt/intel/oneapi/compiler/latest/env/vars.sh
source /opt/intel/oneapi/mkl/latest/env/vars.sh

This need to be done every time you intend to use IPEX, or alternatively, you can append it to your .bashrc file. Another option if you are working in a notebook, is to execute the following as your initial code block:

!source /opt/intel/oneapi/compiler/latest/env/vars.sh
!source /opt/intel/oneapi/mkl/latest/env/vars.sh

Secondly, and more significantly, your system doesn't appear to recognize any XPU... Even when I run the collect_env.py script on my system without first activating the environnement, the script can still detect the A770...

If you try to run the following, can it detect your GPU?

sudo apt-get install clinfo
clinfo
Steve-Tech commented 1 year ago

Not sure if this is part of the same issue, but IPEX only works for me over SSH when there's a logged in gnome session, otherwise I get DPCPP Device count is zero!:

Click for Full Error ``` /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/intel_extension_for_pytorch/xpu/lazy_init.py:73: UserWarning: DPCPP Device count is zero! (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/runtime/Device.cpp:120.) _C._initExtension() terminate called after throwing an instance of 'c10::Error' what(): dpcppSetDevice: device_id is out of range Exception raised from dpcppSetDevice at /build/intel-pytorch-extension/csrc/gpu/runtime/Device.cpp:159 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0x99 (0x7fb3da1bff69 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0xd5 (0x7fb3da188cdf in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libc10.so) frame #2: xpu::dpcpp::dpcppSetDevice(signed char) + 0x114 (0x7fb30fb9e224 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so) frame #3: xpu::dpcpp::set_device(signed char) + 0x20 (0x7fb30fb59bb0 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so) frame #4: xpu::dpcpp::impl::DPCPPGuardImpl::uncheckedSetDevice(c10::Device) const + 0xd (0x7fb30fb5d77d in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so) frame #5: at::AtenIpexTypeXPU::resize_impl(c10::TensorImpl*, c10::ArrayRef, c10::optional >, bool) + 0xb4a (0x7fb30fb8f63a in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so) frame #6: at::AtenIpexTypeXPU::impl::empty_strided_dpcpp(c10::ArrayRef, c10::ArrayRef, c10::TensorOptions const&) + 0xcb (0x7fb318c68c2b in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so) frame #7: at::AtenIpexTypeXPU::empty_strided(c10::ArrayRef, c10::ArrayRef, c10::optional, c10::optional, c10::optional, c10::optional) + 0xe3 (0x7fb318c711c3 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so) frame #8: + 0x1817a50 (0x7fb30fc17a50 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so) frame #9: at::_ops::empty_strided::redispatch(c10::DispatchKeySet, c10::ArrayRef, c10::ArrayRef, c10::optional, c10::optional, c10::optional, c10::optional) + 0xf8 (0x7fb3c84fdec8 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #10: + 0x22064ed (0x7fb3c88064ed in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #11: at::_ops::empty_strided::call(c10::ArrayRef, c10::ArrayRef, c10::optional, c10::optional, c10::optional, c10::optional) + 0x1a6 (0x7fb3c85440c6 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #12: + 0x1573730 (0x7fb3c7b73730 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #13: at::native::_to_copy(at::Tensor const&, c10::optional, c10::optional, c10::optional, c10::optional, bool, c10::optional) + 0x112d (0x7fb3c7e7e73d in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #14: + 0x237fe4d (0x7fb3c897fe4d in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #15: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional, c10::optional, c10::optional, c10::optional, bool, c10::optional) + 0xf8 (0x7fb3c82331c8 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #16: + 0x22067e1 (0x7fb3c88067e1 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #17: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional, c10::optional, c10::optional, c10::optional, bool, c10::optional) + 0xf8 (0x7fb3c82331c8 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #18: + 0x343e17d (0x7fb3c9a3e17d in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #19: + 0x343e610 (0x7fb3c9a3e610 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #20: at::_ops::_to_copy::call(at::Tensor const&, c10::optional, c10::optional, c10::optional, c10::optional, bool, c10::optional) + 0x1e5 (0x7fb3c82b5df5 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #21: at::native::to(at::Tensor const&, c10::optional, c10::optional, c10::optional, c10::optional, bool, bool, c10::optional) + 0x104 (0x7fb3c7e78234 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #22: + 0x24f7c63 (0x7fb3c8af7c63 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #23: at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optional, c10::optional, c10::optional, c10::optional, bool, bool, c10::optional) + 0x1fa (0x7fb3c8412a3a in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #24: + 0x3a3ae9 (0x7fb3d35a3ae9 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_python.so)frame #25: + 0x3a3f84 (0x7fb3d35a3f84 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_python.so)frame #26: python() [0x4fde02] frame #30: python() [0x5080ee] frame #32: python() [0x5080ee] frame #34: python() [0x5080ee] frame #36: python() [0x508246] frame #38: python() [0x5080ee] frame #46: python() [0x592592] frame #48: python() [0x5999cd] frame #49: python() [0x4fceb4] frame #54: python() [0x5b56ff] frame #57: + 0x23a90 (0x7fb3dd223a90 in /lib/x86_64-linux-gnu/libc.so.6) frame #58: __libc_start_main + 0x89 (0x7fb3dd223b49 in /lib/x86_64-linux-gnu/libc.so.6) frame #59: python() [0x5854ee] ```

Also it seems pytorch imports (or packages that import pytorch) after the IPEX import can also trigger DPCPP Device count is zero!.

clinfo.txt

collect_env.txt


Edit: With IPEX 2.0.110+xpu I get Segmentation fault (core dumped) with no other information in the same scenarios.

collect_env_torch2.txt

mhoffma commented 1 year ago

I'm hitting a similar problem

here is the output of https://github.com/pytorch/pytorch/blob/master/torch/utils/collect_env.py

tgt.info.txt

here is the output of clinfo clinfo.txt

here is the captured output of python Intel_Extension_For_PyTorch_Hello_World.py > ~/hello_world.txt 2>&1

hello_world.txt

I get the same warning warning

/opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/intel_extension_for_pytorch/xpu/lazy_init.py:73: UserWarning: DPCPP Device count is zero! (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/runtime/Device.cpp:120.) _C._initExtension()

So I'm totally unsure if things are working correctly for me please take a look and verify things are expected and tell me if I'm utilizing the AI HW correctly.

Thanks Marc

jingxu10 commented 1 year ago

You can try explicitly export LD_PRELOAD= libstdc++.so path in your OS or ${CONDA_PREFIX}/lib/libstdc++.so.

jingxu10 commented 1 year ago

Please verify if drivers are correctly installed. Your clinfo output doesn't show GPUs.