Open axel588 opened 1 year ago
I updated to pytorch 1.13.1 instead of 1.13.a0 ... should update documentation,
I have this error :
Traceback (most recent call last):
File "/home/axel/ai_tr/cod/train.py", line 2, in <module>
import intel_extension_for_pytorch as ipex
File "/home/axel/ai_tr/lib/python3.10/site-packages/intel_extension_for_pytorch/__init__.py", line 2, in <module>
from . import cpu
File "/home/axel/ai_tr/lib/python3.10/site-packages/intel_extension_for_pytorch/cpu/__init__.py", line 2, in <module>
from . import runtime
File "/home/axel/ai_tr/lib/python3.10/site-packages/intel_extension_for_pytorch/cpu/runtime/__init__.py", line 3, in <module>
from .multi_stream import MultiStreamModule, get_default_num_streams, \
File "/home/axel/ai_tr/lib/python3.10/site-packages/intel_extension_for_pytorch/cpu/runtime/multi_stream.py", line 4, in <module>
import intel_extension_for_pytorch._C as core
ImportError: /home/axel/ai_tr/lib/python3.10/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so: undefined symbol: _ZNK5torch8autograd4Node4nameB5cxx11Ev
Much weirder is the package it asked to install :
(ai_tr) axel@Artishima:~/ai_tr/cod$ python -m pip install torch==1.13.1 -f https://developer.intel.com/ipex-whl-stable-xpu
Looking in links: https://developer.intel.com/ipex-whl-stable-xpu
DEPRECATION: The HTML index page being used (https://www.intel.com/content/dam/develop/external/us/en/documents/ipex/whl-stable-xpu.html) is not a proper HTML 5 document. This is in violation of PEP 503 which requires these pages to be well-formed HTML 5 documents. Please reach out to the owners of this index page, and ask them to update this index page to a valid HTML 5 document. pip 22.2 will enforce this behaviour change. Discussion can be found at https://github.com/pypa/pip/issues/10825
Collecting torch==1.13.1
Downloading torch-1.13.1-cp310-cp310-manylinux1_x86_64.whl (887.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 887.5/887.5 MB 2.0 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc-cu11==11.7.99
Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.0/21.0 MB 65.7 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu11==8.5.0.96
Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 557.1/557.1 MB 6.5 MB/s eta 0:00:00
Collecting nvidia-cuda-runtime-cu11==11.7.99
Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 849.3/849.3 KB 46.9 MB/s eta 0:00:00
Collecting nvidia-cublas-cu11==11.10.3.66
Downloading nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 317.1/317.1 MB 18.9 MB/s eta 0:00:00
Requirement already satisfied: typing-extensions in /home/axel/ai_tr/lib/python3.10/site-packages (from torch==1.13.1) (4.4.0)
Collecting wheel
Downloading wheel-0.38.4-py3-none-any.whl (36 kB)
Requirement already satisfied: setuptools in /home/axel/ai_tr/lib/python3.10/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch==1.13.1) (59.6.0)
Installing collected packages: wheel, nvidia-cuda-nvrtc-cu11, nvidia-cuda-runtime-cu11, nvidia-cublas-cu11, nvidia-cudnn-cu11, torch
Attempting uninstall: torch
Found existing installation: torch 1.13.0a0+gitb1dde16
Uninstalling torch-1.13.0a0+gitb1dde16:
Successfully uninstalled torch-1.13.0a0+gitb1dde16
Successfully installed nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 torch-1.13.1 wheel-0.38.4
I reinstalled from wheel pytorch :
pip install torch-1.13.0a0+gitb1dde16-cp310-cp310-linux_x86_64.whl
still the same error : number of dpcpp devices ..
@sanchitintel Would you help to check this issue?
Please install the torch 1.13.0 provided in below page. https://github.com/intel/intel-extension-for-pytorch/releases/tag/v1.13.10%2Bxpu
@axel588 could you tell, what else are you importing before line m =model.to(device)? If you are importing matplotlib.pyplot before it, try removing it.
Hi @alex588, can you please share your full script?
Please ensure that the lines import torch
and import intel_extension_for_pytorch
are present before any other module's imports in your python script. We had encountered a similar issue & have been using such a workaround.
I've seen the same error under WSL2 + Ubuntu 22.04 (Arc a770) with the GPU installation. Most recently, I tried the official releases, listed above, for python 3.9.x and 3.10.x.
In addition, running on the CPU will return a warning about the number of dpcpp devices. It also just crashed my machine. Perhaps available memory was exhausted.
Check that you have resizable BAR enabled in the system BIOS
I am facing the same issue. I have checked if ReBAR is enabled using lspci -v. The code that I am using is the starter code in the readme for the GPU. I am also using the recommended packages as well from the README. System Details: Fedora 38 i7-12700H ArcA370M Appropriate Torchvision package (0.14.0) was installed with --no-check-deps
Could you run https://github.com/intel/intel-extension-for-pytorch/raw/master/scripts/collect_env.py and share the outputs?
Sure! here is the output: https://pastebin.com/ZFcLesPB
please use oneapi basekit 2023.0 with 1.13.10. Also, it seems like you don't have level-0 installed. Please install it as well. Driver version better to be 540, as shown in the installation guide.
DPCPP runtime version: 2023.1.0 <====================
MKL version: 2023.1.0 <===========================
GPU models and configuration:
Intel OpenCL ICD version: 23.05.25593.18-1.fc38
Level Zero version: N/A <==========================
dnf/yum install -y intel-opencl level-zero intel-level-zero-gpu
By the way, we will have a new release soon. Probably you can try the new version directly soon.
I am trying to install level-zero. There seems to be no repos for fedora, so would I be able to install the RHEL one?
Since Fedora is not in the list of our verified OSs, we cannot provide support unfortunately. But personally I would recommend you to have a try with the RHEL one.
After installing the RHEL packages for level-zero, I am faced with the following error:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/fx/graph_module.py", line 658, in call_wrapped
return self._wrapped_call(self, *args, **kwargs)
File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/fx/graph_module.py", line 277, in __call__
raise e
File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/fx/graph_module.py", line 267, in __call__
return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "<eval_with_key>.2", line 5, in forward
File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/cc/Dev/IdeaProjects/test/intel-pyt/venv/lib64/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: could not create an engine
This is when calling model(data)
Would you try compile from source with the latest code that works with basekit 2023.1 and the latest driver 602? https://github.com/intel/intel-extension-for-pytorch/blob/xpu-master/scripts/compile_bundle.sh
Will try in the coming days.
@jingxu10 I have installed all what required but I get this output from https://github.com/intel/intel-extension-for-pytorch/raw/master/scripts/collect_env.py.
Collecting environment information... PyTorch version: 1.13.0a0+git6c9b55e PyTorch CXX11 ABI: Yes IPEX version: 1.13.120+xpu IPEX commit: https://github.com/intel/intel-extension-for-pytorch/commit/c2a37012e9eeb20c317cfd00583da23ca538d2b2 Build type: Release
OS: Ubuntu 22.04.2 LTS (x86_64) GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0 Clang version: N/A IGC version: N/A CMake version: version 3.22.1 Libc version: glibc-2.35
Python version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] (64-bit runtime) Python platform: Linux-5.19.0-41-generic-x8664-with-glibc2.35 Is XPU available: False DPCPP runtime version: N/A MKL version: N/A_ GPU models and configuration:
Intel OpenCL ICD version: 23.05.25593.18-60122.04 Level Zero version: 1.3.25593.18-60122.04
CPU: 架构: x86_64 CPU 运行模式: 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual 字节序: Little Endian CPU: 24 在线 CPU 列表: 0-23 厂商 ID: AuthenticAMD 型号名称: AMD Ryzen 9 5900X 12-Core Processor CPU 系列: 25 型号: 33 每个核的线程数: 2 每个座的核数: 12 座: 1 步进: 0 Frequency boost: enabled CPU 最大 MHz: 4950.1948 CPU 最小 MHz: 2200.0000 BogoMIPS: 7399.63 标记: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm 虚拟化: AMD-V L1d 缓存: 384 KiB (12 instances) L1i 缓存: 384 KiB (12 instances) L2 缓存: 6 MiB (12 instances) L3 缓存: 64 MiB (2 instances) NUMA 节点: 1 NUMA 节点0 CPU: 0-23 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected
Versions of relevant libraries: [pip3] intel-extension-for-pytorch==1.13.120+xpu [pip3] numpy==1.21.5 [pip3] torch==1.13.0a0+git6c9b55e [pip3] torchvision==0.14.1a0+5e8e2f1 [conda] intel-extension-for-pytorch 1.13.120+xpu pypi_0 pypi [conda] numpy 1.24.3 pypi_0 pypi [conda] torch 1.13.0a0+git6c9b55e pypi_0 pypi [conda] torchvision 0.14.1a0+5e8e2f1 pypi_0 pypi
@qzx1013 Not from Intel, but I might be able to assist you.
Firstly, if your DPCPP runtime version
and MKL version
are showing as N/A
, it implies that you haven't properly initialized your environnement. You need to source both the DPCPP
and MKL
as follows:
source /opt/intel/oneapi/compiler/latest/env/vars.sh
source /opt/intel/oneapi/mkl/latest/env/vars.sh
This need to be done every time you intend to use IPEX, or alternatively, you can append it to your .bashrc
file. Another option if you are working in a notebook, is to execute the following as your initial code block:
!source /opt/intel/oneapi/compiler/latest/env/vars.sh
!source /opt/intel/oneapi/mkl/latest/env/vars.sh
Secondly, and more significantly, your system doesn't appear to recognize any XPU
... Even when I run the collect_env.py
script on my system without first activating the environnement, the script can still detect the A770...
If you try to run the following, can it detect your GPU?
sudo apt-get install clinfo
clinfo
Not sure if this is part of the same issue, but IPEX only works for me over SSH when there's a logged in gnome session, otherwise I get DPCPP Device count is zero!
:
Also it seems pytorch imports (or packages that import pytorch) after the IPEX import can also trigger DPCPP Device count is zero!
.
Edit: With IPEX 2.0.110+xpu I get Segmentation fault (core dumped)
with no other information in the same scenarios.
I'm hitting a similar problem
here is the output of https://github.com/pytorch/pytorch/blob/master/torch/utils/collect_env.py
here is the output of clinfo clinfo.txt
here is the captured output of python Intel_Extension_For_PyTorch_Hello_World.py > ~/hello_world.txt 2>&1
I get the same warning warning
/opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/intel_extension_for_pytorch/xpu/lazy_init.py:73: UserWarning: DPCPP Device count is zero! (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/runtime/Device.cpp:120.) _C._initExtension()
So I'm totally unsure if things are working correctly for me please take a look and verify things are expected and tell me if I'm utilizing the AI HW correctly.
Thanks Marc
You can try explicitly export LD_PRELOAD=
libstdc++.so path in your OS or ${CONDA_PREFIX}/lib/libstdc++.so
.
Please verify if drivers are correctly installed. Your clinfo
output doesn't show GPUs.
Hello, I used the gpu configuration oneAPI is installed correctly I am in a python virtual environment ai_tr I have this issue with Pytorch, the two import are on the top of the file : import torch import intel_extension_for_pytorch as ipex
I runned: source ${ONEAPI_ROOT}/setvars.s with output :
With --force :
But this error keep appearing whenether I try to run my training python file:
xpu /home/axel/ai_tr/lib/python3.10/site-packages/intel_extension_for_pytorch/xpu/lazy_init.py:73: UserWarning: DPCPP Device count is zero! (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/runtime/Device.cpp:120.) _C._initExtension() /home/axel/ai_tr/lib/python3.10/site-packages/torch/nn/modules/module.py:985: UserWarning: dpcppSetDevice: device_id is out of range (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/runtime/Device.cpp:159.) return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) Traceback (most recent call last): File "/home/axel/ai_tr/cod/train.py", line 190, in <module> m = model.to(device) File "/home/axel/ai_tr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 987, in to return self._apply(convert) File "/home/axel/ai_tr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 639, in _apply module._apply(fn) File "/home/axel/ai_tr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 662, in _apply param_applied = fn(param) File "/home/axel/ai_tr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 985, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: Number of dpcpp devices should be greater than zero!
Everything related to mkl is installed correctl and path are set correctly and working, I am on ubuntu 22.04 using torch 13.1 on WSL 2 on windows 11 with intel drivers installed on windows 11 on Arc 770 with i9 13900K.
The error is trigerred here :