RuntimeError: Number of dpcpp devices should be greater than zero!

axel588 commented 1 year ago

Hello, I used the gpu configuration oneAPI is installed correctly I am in a python virtual environment ai_tr I have this issue with Pytorch, the two import are on the top of the file : import torch import intel_extension_for_pytorch as ipex

I runned: source ${ONEAPI_ROOT}/setvars.s with output :

(ai_tr) axel@Artishima:~/ai_tr/cod$ source ${ONEAPI_ROOT}/setvars.sh

:: WARNING: setvars.sh has already been run. Skipping re-execution.
   To force a re-execution of setvars.sh, use the '--force' option.
   Using '--force' can result in excessive use of your environment variables.

usage: source setvars.sh [--force] [--config=file] [--help] [...]
  --force        Force setvars.sh to re-run, doing so may overload environment.
  --config=file  Customize env vars using a setvars.sh configuration file.
  --help         Display this help message and exit.
  ...            Additional args are passed to individual env/vars.sh scripts
                 and should follow this script's arguments.

  Some POSIX shells do not accept command-line options. In that case, you can pass
  command-line options via the SETVARS_ARGS environment variable. For example:

  $ SETVARS_ARGS="ia32 --config=config.txt" ; export SETVARS_ARGS
  $ . path/to/setvars.sh

  The SETVARS_ARGS environment variable is cleared on exiting setvars.sh.

With --force :

(ai_tr) axel@Artishima:~/ai_tr/cod$ source ${ONEAPI_ROOT}/setvars.sh --force

:: initializing oneAPI environment ...
   -bash: BASH_VERSION = 5.1.16(1)-release
   args: Using "$@" for setvars.sh arguments: --force
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: vpl -- latest
:: vtune -- latest
:: oneAPI environment initialized ::

But this error keep appearing whenether I try to run my training python file: xpu /home/axel/ai_tr/lib/python3.10/site-packages/intel_extension_for_pytorch/xpu/lazy_init.py:73: UserWarning: DPCPP Device count is zero! (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/runtime/Device.cpp:120.) _C._initExtension() /home/axel/ai_tr/lib/python3.10/site-packages/torch/nn/modules/module.py:985: UserWarning: dpcppSetDevice: device_id is out of range (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/runtime/Device.cpp:159.) return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) Traceback (most recent call last): File "/home/axel/ai_tr/cod/train.py", line 190, in <module> m = model.to(device) File "/home/axel/ai_tr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 987, in to return self._apply(convert) File "/home/axel/ai_tr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 639, in _apply module._apply(fn) File "/home/axel/ai_tr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 662, in _apply param_applied = fn(param) File "/home/axel/ai_tr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 985, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: Number of dpcpp devices should be greater than zero!

Everything related to mkl is installed correctl and path are set correctly and working, I am on ubuntu 22.04 using torch 13.1 on WSL 2 on windows 11 with intel drivers installed on windows 11 on Arc 770 with i9 13900K.

The error is trigerred here :


#the line below is triggering the error
m = model.to(device)
m = ipex.optimize(m)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)```

Also the name optimize is a weird naming.

It does seems to be an out of range issue, I have no idea how to solve this issue.

axel588 commented 1 year ago

I updated to pytorch 1.13.1 instead of 1.13.a0 ... should update documentation,

I have this error :


Traceback (most recent call last):
  File "/home/axel/ai_tr/cod/train.py", line 2, in <module>
    import intel_extension_for_pytorch as ipex
  File "/home/axel/ai_tr/lib/python3.10/site-packages/intel_extension_for_pytorch/__init__.py", line 2, in <module>
    from . import cpu
  File "/home/axel/ai_tr/lib/python3.10/site-packages/intel_extension_for_pytorch/cpu/__init__.py", line 2, in <module>
    from . import runtime
  File "/home/axel/ai_tr/lib/python3.10/site-packages/intel_extension_for_pytorch/cpu/runtime/__init__.py", line 3, in <module>
    from .multi_stream import MultiStreamModule, get_default_num_streams, \
  File "/home/axel/ai_tr/lib/python3.10/site-packages/intel_extension_for_pytorch/cpu/runtime/multi_stream.py", line 4, in <module>
    import intel_extension_for_pytorch._C as core
ImportError: /home/axel/ai_tr/lib/python3.10/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so: undefined symbol: _ZNK5torch8autograd4Node4nameB5cxx11Ev

Much weirder is the package it asked to install :

(ai_tr) axel@Artishima:~/ai_tr/cod$ python -m pip install torch==1.13.1 -f https://developer.intel.com/ipex-whl-stable-xpu
Looking in links: https://developer.intel.com/ipex-whl-stable-xpu
DEPRECATION: The HTML index page being used (https://www.intel.com/content/dam/develop/external/us/en/documents/ipex/whl-stable-xpu.html) is not a proper HTML 5 document. This is in violation of PEP 503 which requires these pages to be well-formed HTML 5 documents. Please reach out to the owners of this index page, and ask them to update this index page to a valid HTML 5 document. pip 22.2 will enforce this behaviour change. Discussion can be found at https://github.com/pypa/pip/issues/10825
Collecting torch==1.13.1
  Downloading torch-1.13.1-cp310-cp310-manylinux1_x86_64.whl (887.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 887.5/887.5 MB 2.0 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc-cu11==11.7.99
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.0/21.0 MB 65.7 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu11==8.5.0.96
  Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 557.1/557.1 MB 6.5 MB/s eta 0:00:00
Collecting nvidia-cuda-runtime-cu11==11.7.99
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 849.3/849.3 KB 46.9 MB/s eta 0:00:00
Collecting nvidia-cublas-cu11==11.10.3.66
  Downloading nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 317.1/317.1 MB 18.9 MB/s eta 0:00:00
Requirement already satisfied: typing-extensions in /home/axel/ai_tr/lib/python3.10/site-packages (from torch==1.13.1) (4.4.0)
Collecting wheel
  Downloading wheel-0.38.4-py3-none-any.whl (36 kB)
Requirement already satisfied: setuptools in /home/axel/ai_tr/lib/python3.10/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch==1.13.1) (59.6.0)
Installing collected packages: wheel, nvidia-cuda-nvrtc-cu11, nvidia-cuda-runtime-cu11, nvidia-cublas-cu11, nvidia-cudnn-cu11, torch
  Attempting uninstall: torch
    Found existing installation: torch 1.13.0a0+gitb1dde16
    Uninstalling torch-1.13.0a0+gitb1dde16:
      Successfully uninstalled torch-1.13.0a0+gitb1dde16
Successfully installed nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 torch-1.13.1 wheel-0.38.4

I reinstalled from wheel pytorch :

pip install torch-1.13.0a0+gitb1dde16-cp310-cp310-linux_x86_64.whl

still the same error : number of dpcpp devices ..

jingxu10 commented 1 year ago

@sanchitintel Would you help to check this issue?

gujinghui commented 1 year ago

Please install the torch 1.13.0 provided in below page. https://github.com/intel/intel-extension-for-pytorch/releases/tag/v1.13.10%2Bxpu

asirvaiy commented 1 year ago

@axel588 could you tell, what else are you importing before line m =model.to(device)? If you are importing matplotlib.pyplot before it, try removing it.

sanchitintel commented 1 year ago

Hi @alex588, can you please share your full script? Please ensure that the lines import torch and import intel_extension_for_pytorch are present before any other module's imports in your python script. We had encountered a similar issue & have been using such a workaround.

kns1966 commented 1 year ago

I've seen the same error under WSL2 + Ubuntu 22.04 (Arc a770) with the GPU installation. Most recently, I tried the official releases, listed above, for python 3.9.x and 3.10.x.

In addition, running on the CPU will return a warning about the number of dpcpp devices. It also just crashed my machine. Perhaps available memory was exhausted.

nathanodle commented 1 year ago

Check that you have resizable BAR enabled in the system BIOS

charitarthchugh commented 1 year ago

I am facing the same issue. I have checked if ReBAR is enabled using lspci -v. The code that I am using is the starter code in the readme for the GPU. I am also using the recommended packages as well from the README. System Details: Fedora 38 i7-12700H ArcA370M Appropriate Torchvision package (0.14.0) was installed with --no-check-deps

jingxu10 commented 1 year ago

Could you run https://github.com/intel/intel-extension-for-pytorch/raw/master/scripts/collect_env.py and share the outputs?

charitarthchugh commented 1 year ago

Sure! here is the output: https://pastebin.com/ZFcLesPB

jingxu10 commented 1 year ago

please use oneapi basekit 2023.0 with 1.13.10. Also, it seems like you don't have level-0 installed. Please install it as well. Driver version better to be 540, as shown in the installation guide.

DPCPP runtime version: 2023.1.0 <====================
MKL version: 2023.1.0 <===========================
GPU models and configuration: 

Intel OpenCL ICD version: 23.05.25593.18-1.fc38
Level Zero version: N/A <==========================

dnf/yum install -y intel-opencl level-zero intel-level-zero-gpu

By the way, we will have a new release soon. Probably you can try the new version directly soon.

charitarthchugh commented 1 year ago

I am trying to install level-zero. There seems to be no repos for fedora, so would I be able to install the RHEL one?

jingxu10 commented 1 year ago

Since Fedora is not in the list of our verified OSs, we cannot provide support unfortunately. But personally I would recommend you to have a try with the RHEL one.

charitarthchugh commented 1 year ago

After installing the RHEL packages for level-zero, I am faced with the following error:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/fx/graph_module.py", line 658, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
  File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/fx/graph_module.py", line 277, in __call__
    raise e
  File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/fx/graph_module.py", line 267, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
  File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "<eval_with_key>.2", line 5, in forward
  File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: could not create an engine

This is when calling model(data)

jingxu10 commented 1 year ago

Would you try compile from source with the latest code that works with basekit 2023.1 and the latest driver 602? https://github.com/intel/intel-extension-for-pytorch/blob/xpu-master/scripts/compile_bundle.sh

charitarthchugh commented 1 year ago

Will try in the coming days.

qzx1013 commented 1 year ago

@jingxu10 I have installed all what required but I get this output from https://github.com/intel/intel-extension-for-pytorch/raw/master/scripts/collect_env.py.

Collecting environment information... PyTorch version: 1.13.0a0+git6c9b55e PyTorch CXX11 ABI: Yes IPEX version: 1.13.120+xpu IPEX commit: https://github.com/intel/intel-extension-for-pytorch/commit/c2a37012e9eeb20c317cfd00583da23ca538d2b2 Build type: Release

OS: Ubuntu 22.04.2 LTS (x86_64) GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0 Clang version: N/A IGC version: N/A CMake version: version 3.22.1 Libc version: glibc-2.35

Python version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] (64-bit runtime) Python platform: Linux-5.19.0-41-generic-x8664-with-glibc2.35 Is XPU available: False DPCPP runtime version: N/A MKL version: N/A_ GPU models and configuration:

Intel OpenCL ICD version: 23.05.25593.18-60122.04 Level Zero version: 1.3.25593.18-60122.04

CPU: 架构： x86_64 CPU 运行模式： 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual 字节序： Little Endian CPU: 24 在线 CPU 列表： 0-23 厂商 ID： AuthenticAMD 型号名称： AMD Ryzen 9 5900X 12-Core Processor CPU 系列： 25 型号： 33 每个核的线程数： 2 每个座的核数： 12 座： 1 步进： 0 Frequency boost: enabled CPU 最大 MHz： 4950.1948 CPU 最小 MHz： 2200.0000 BogoMIPS： 7399.63 标记： fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm 虚拟化： AMD-V L1d 缓存： 384 KiB (12 instances) L1i 缓存： 384 KiB (12 instances) L2 缓存： 6 MiB (12 instances) L3 缓存： 64 MiB (2 instances) NUMA 节点： 1 NUMA 节点0 CPU： 0-23 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] intel-extension-for-pytorch==1.13.120+xpu [pip3] numpy==1.21.5 [pip3] torch==1.13.0a0+git6c9b55e [pip3] torchvision==0.14.1a0+5e8e2f1 [conda] intel-extension-for-pytorch 1.13.120+xpu pypi_0 pypi [conda] numpy 1.24.3 pypi_0 pypi [conda] torch 1.13.0a0+git6c9b55e pypi_0 pypi [conda] torchvision 0.14.1a0+5e8e2f1 pypi_0 pypi

fredlarochelle commented 1 year ago

@qzx1013 Not from Intel, but I might be able to assist you.

Firstly, if your DPCPP runtime version and MKL version are showing as N/A, it implies that you haven't properly initialized your environnement. You need to source both the DPCPP and MKL as follows:

source /opt/intel/oneapi/compiler/latest/env/vars.sh
source /opt/intel/oneapi/mkl/latest/env/vars.sh

This need to be done every time you intend to use IPEX, or alternatively, you can append it to your .bashrc file. Another option if you are working in a notebook, is to execute the following as your initial code block:

!source /opt/intel/oneapi/compiler/latest/env/vars.sh
!source /opt/intel/oneapi/mkl/latest/env/vars.sh

Secondly, and more significantly, your system doesn't appear to recognize any XPU... Even when I run the collect_env.py script on my system without first activating the environnement, the script can still detect the A770...

If you try to run the following, can it detect your GPU?

sudo apt-get install clinfo
clinfo

Steve-Tech commented 1 year ago

Not sure if this is part of the same issue, but IPEX only works for me over SSH when there's a logged in gnome session, otherwise I get DPCPP Device count is zero!:

Click for Full Error

``` /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/intel_extension_for_pytorch/xpu/lazy_init.py:73: UserWarning: DPCPP Device count is zero! (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/runtime/Device.cpp:120.) _C._initExtension() terminate called after throwing an instance of 'c10::Error' what(): dpcppSetDevice: device_id is out of range Exception raised from dpcppSetDevice at /build/intel-pytorch-extension/csrc/gpu/runtime/Device.cpp:159 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0x99 (0x7fb3da1bff69 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0xd5 (0x7fb3da188cdf in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libc10.so) frame #2: xpu::dpcpp::dpcppSetDevice(signed char) + 0x114 (0x7fb30fb9e224 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so) frame #3: xpu::dpcpp::set_device(signed char) + 0x20 (0x7fb30fb59bb0 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so) frame #4: xpu::dpcpp::impl::DPCPPGuardImpl::uncheckedSetDevice(c10::Device) const + 0xd (0x7fb30fb5d77d in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so) frame #5: at::AtenIpexTypeXPU::resize_impl(c10::TensorImpl*, c10::ArrayRef, c10::optional >, bool) + 0xb4a (0x7fb30fb8f63a in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so) frame #6: at::AtenIpexTypeXPU::impl::empty_strided_dpcpp(c10::ArrayRef, c10::ArrayRef, c10::TensorOptions const&) + 0xcb (0x7fb318c68c2b in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so) frame #7: at::AtenIpexTypeXPU::empty_strided(c10::ArrayRef, c10::ArrayRef, c10::optional, c10::optional, c10::optional, c10::optional) + 0xe3 (0x7fb318c711c3 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so) frame #8: + 0x1817a50 (0x7fb30fc17a50 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so) frame #9: at::_ops::empty_strided::redispatch(c10::DispatchKeySet, c10::ArrayRef, c10::ArrayRef, c10::optional, c10::optional, c10::optional, c10::optional) + 0xf8 (0x7fb3c84fdec8 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #10: + 0x22064ed (0x7fb3c88064ed in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #11: at::_ops::empty_strided::call(c10::ArrayRef, c10::ArrayRef, c10::optional, c10::optional, c10::optional, c10::optional) + 0x1a6 (0x7fb3c85440c6 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #12: + 0x1573730 (0x7fb3c7b73730 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #13: at::native::_to_copy(at::Tensor const&, c10::optional, c10::optional, c10::optional, c10::optional, bool, c10::optional) + 0x112d (0x7fb3c7e7e73d in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #14: + 0x237fe4d (0x7fb3c897fe4d in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #15: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional, c10::optional, c10::optional, c10::optional, bool, c10::optional) + 0xf8 (0x7fb3c82331c8 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #16: + 0x22067e1 (0x7fb3c88067e1 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #17: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional, c10::optional, c10::optional, c10::optional, bool, c10::optional) + 0xf8 (0x7fb3c82331c8 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #18: + 0x343e17d (0x7fb3c9a3e17d in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #19: + 0x343e610 (0x7fb3c9a3e610 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #20: at::_ops::_to_copy::call(at::Tensor const&, c10::optional, c10::optional, c10::optional, c10::optional, bool, c10::optional) + 0x1e5 (0x7fb3c82b5df5 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #21: at::native::to(at::Tensor const&, c10::optional, c10::optional, c10::optional, c10::optional, bool, bool, c10::optional) + 0x104 (0x7fb3c7e78234 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #22: + 0x24f7c63 (0x7fb3c8af7c63 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #23: at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optional, c10::optional, c10::optional, c10::optional, bool, bool, c10::optional) + 0x1fa (0x7fb3c8412a3a in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #24: + 0x3a3ae9 (0x7fb3d35a3ae9 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_python.so)frame #25: + 0x3a3f84 (0x7fb3d35a3f84 in /home/stephen/miniconda3/envs/fastchat310/lib/python3.10/site-packages/torch/lib/libtorch_python.so)frame #26: python() [0x4fde02] frame #30: python() [0x5080ee] frame #32: python() [0x5080ee] frame #34: python() [0x5080ee] frame #36: python() [0x508246] frame #38: python() [0x5080ee] frame #46: python() [0x592592] frame #48: python() [0x5999cd] frame #49: python() [0x4fceb4] frame #54: python() [0x5b56ff] frame #57: + 0x23a90 (0x7fb3dd223a90 in /lib/x86_64-linux-gnu/libc.so.6) frame #58: __libc_start_main + 0x89 (0x7fb3dd223b49 in /lib/x86_64-linux-gnu/libc.so.6) frame #59: python() [0x5854ee] ```

Also it seems pytorch imports (or packages that import pytorch) after the IPEX import can also trigger DPCPP Device count is zero!.

clinfo.txt

collect_env.txt

Edit: With IPEX 2.0.110+xpu I get Segmentation fault (core dumped) with no other information in the same scenarios.

collect_env_torch2.txt

mhoffma commented 1 year ago

I'm hitting a similar problem

here is the output of https://github.com/pytorch/pytorch/blob/master/torch/utils/collect_env.py

tgt.info.txt

here is the output of clinfo clinfo.txt

here is the captured output of python Intel_Extension_For_PyTorch_Hello_World.py > ~/hello_world.txt 2>&1

hello_world.txt

I get the same warning warning

/opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/intel_extension_for_pytorch/xpu/lazy_init.py:73: UserWarning: DPCPP Device count is zero! (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/runtime/Device.cpp:120.) _C._initExtension()

So I'm totally unsure if things are working correctly for me please take a look and verify things are expected and tell me if I'm utilizing the AI HW correctly.

Thanks Marc

jingxu10 commented 1 year ago

You can try explicitly export LD_PRELOAD= libstdc++.so path in your OS or ${CONDA_PREFIX}/lib/libstdc++.so.

jingxu10 commented 1 year ago

Please verify if drivers are correctly installed. Your clinfo output doesn't show GPUs.

intel / intel-extension-for-pytorch

RuntimeError: Number of dpcpp devices should be greater than zero! #287