Open JannisFengler opened 1 year ago
Hi @JannisFengler, I was not able to reproduce the error with the latest docker image
Could you please try pulling the following image: docker pull intel/intel-extension-for-pytorch:xpu-flex-2.0.110-xpu
More info on how to run the image can be found here: https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu&version=v2.0.110%2Bxpu
# host
Icon name: computer-desktop
Operating System: Ubuntu 22.04.3 LTS
Kernel: Linux 6.5.0-14-generic
Driver: 775
intel-extension-for-pytorch: 2.1.10+xpu
oneAPI: l_BaseKit_p_2024.0.0.49564.sh
GPU: Arc 750
# container 1
Docker container: intel/oneapi-basekit:2024.0.1-devel-ubuntu22.04
Driver: 775
intel-extension-for-pytorch: 2.1.10+xpu
GPU: Arc 750 & 770
# container 2
Docker container: intel/oneapi-basekit:2024.0.0-devel-ubuntu22.04
Driver: 732
intel-extension-for-pytorch: 2.1.10+xpu
GPU: Arc 750
# container 3
Docker container: intel/dlstreamer:2023.0.0-ubuntu22-gpu682-dpcpp
Driver: 682
intel-extension-for-pytorch: 2.1.10+xpu
oneAPI: l_BaseKit_p_2024.0.0.49564.sh
GPU: Arc 750
So a workaround could be: try to install packages with version specified in driver 682, or use a docker image intel/dlstreamer:2023.0.0-ubuntu22-gpu682-dpcpp for convenience.
BTW, I have also tested the above suggested docker image: intel/intel-extension-for-pytorch:xpu-flex-2.0.110-xpu, and it also worked (with driver 647).
# host
Icon name: computer-desktop
Operating System: Ubuntu 22.04.3 LTS
Kernel: Linux 6.5.0-14-generic
Driver: 775 (not runnable, use docker image below instead)
Docker container: intel/intel-extension-for-pytorch:xpu-flex-2.0.110-xpu
intel-extension-for-pytorch: 2.0.110+xpu
Driver (in container): 647
GPU: Arc 750
Describe the bug
I am encountering a consistent segmentation fault during the training of machine learning models using the Intel ARC 770 GPU. The fault appears after a certain number of training steps and is noticeably more frequent when working with larger models or datasets.
Error Epoch 62, Loss: 0.9934, Time: 0.08 seconds Epoch 63, Loss: 0.9946, Time: 0.08 seconds Traceback(most recent call last): File "data/dummy_intel.py", line 97, in
avg_loss = train_one_epoch(train_loader, model, criterion, optimizer, device)
File "data/dummy_intel.py", line 75, in train_one_epoch
avg_loss += loss.item()
RuntimeError: Native API failed. Native API returns: -1 (CL_DEVICE_NOT_FOUND) -1 (CLDEVICE
NOT_FOUND)
Segmentation fault
Versions
Intel XPU Docker image 2d2a3356c190 in WSL Windows 11
(But I got the same error in native Ubuntu)