Segmentation fault on WSL2

ryotatomioka commented 1 year ago

I followed the instructions on this page (Enable PyTorch with DirectML on WSL 2) and got a segmentation fault.

$  python
Python 3.10.8 (main, Nov 24 2022, 14:13:03) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torch_directml
>>> dml = torch_directml.device()
>>> dml
device(type='privateuseone', index=0)
>>> tensor1 = torch.tensor([1]).to(dml)
Segmentation fault

Device details: Suface Laptop 4 Windows 11 Enterprise (version: 10.0.22621 Build 226221) CPU: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz 3.00 GHz Memory: 32.0 GB (31.8 GB usable) Graphics: Intel Iris Xe Graphics (driver version 27.20.100.9268)

WSL details:

> wsl --version
WSL version: 1.0.3.0
Kernel version: 5.15.79.1
WSLg version: 1.0.47
MSRDC version: 1.2.3575
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22621.963

Python environment details: Python 3.10 torch==1.13.1 torch-directml==0.1.13.dev221216

pip list:

Package                  Version
------------------------ ----------------
absl-py                  1.3.0
aiohttp                  3.8.3
aiosignal                1.2.0
async-timeout            4.0.2
attrs                    22.1.0
blinker                  1.4
Bottleneck               1.3.5
brotlipy                 0.7.0
cachetools               4.2.2
certifi                  2022.12.7
cffi                     1.15.1
charset-normalizer       2.0.4
click                    8.0.4
contourpy                1.0.5
cryptography             38.0.1
cycler                   0.11.0
flit_core                3.6.0
fonttools                4.25.0
frozenlist               1.3.3
google-auth              2.6.0
google-auth-oauthlib     0.4.4
grpcio                   1.42.0
idna                     3.4
kiwisolver               1.4.4
Markdown                 3.4.1
MarkupSafe               2.1.1
matplotlib               3.6.2
mkl-fft                  1.3.1
mkl-random               1.2.2
mkl-service              2.4.0
multidict                6.0.2
munkres                  1.1.4
numexpr                  2.8.4
numpy                    1.23.5
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
oauthlib                 3.2.1
opencv-python            4.7.0.68
packaging                22.0
pandas                   1.5.2
Pillow                   9.3.0
pip                      22.3.1
ply                      3.11
protobuf                 3.20.1
pyasn1                   0.4.8
pyasn1-modules           0.2.8
pycparser                2.21
PyJWT                    2.4.0
pyOpenSSL                22.0.0
pyparsing                3.0.9
PyQt5-sip                12.11.0
PySocks                  1.7.1
python-dateutil          2.8.2
pytz                     2022.7
PyYAML                   6.0
requests                 2.28.1
requests-oauthlib        1.3.0
rsa                      4.7.2
setuptools               65.5.0
sip                      6.6.2
six                      1.16.0
tensorboard              2.10.0
tensorboard-data-server  0.6.1
tensorboard-plugin-wit   1.8.1
toml                     0.10.2
torch                    1.13.1
torch-directml           0.1.13.dev221216
torchvision              0.14.1
tornado                  6.2
tqdm                     4.64.1
typing_extensions        4.4.0
urllib3                  1.26.13
Werkzeug                 2.2.2
wget                     3.2
wheel                    0.37.1
yarl                     1.8.1

Looong01 commented 1 year ago

While, I do not have this problem:

Python 3.10.8 (main, Nov 24 2022, 14:13:03) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torch_directml
>>> dml = torch_directml.device()
>>> tensor1 = torch.tensor([1]).to(dml)
>>> print(tensor1)
tensor([1], device='privateuseone:0')

My GPU is Nvdia's. Maybe you can try to upgrade torch-directml to version 0.1.13.1.dev230119 and upgrade your GPU driver.

valleyUp commented 1 year ago

I noticed you have installed the nvidia-XXX, maybe you can try "pip install torchvision" or "conda install torchvision -c pytorch" after "conda install pytorch cpuonly -c pytorch".

zhangxiang1993 commented 1 year ago

Hi @ryotatomioka, Thanks for reporting this. Can you provide us your driver version?

Looong01 commented 1 year ago

Hi @ryotatomioka, Thanks for reporting this. Can you provide us your driver version?

What do you mean, "driver version"? Do you refer to the graphic cards' driver version?

My GPU is AMD RX 6700xt

Driver's version is 22.5.1

versions:

The driver's version: 21.50.21.11-220428a-382767C-AMD-Software-Adrenalin-Edition
AMD Windows driver version: 30.0.15021.11005
D3D driver version: 9.14.10.01520

zhangxiang1993 commented 1 year ago

Hi @Looong01, Thanks for clarifying my question, And yes, I was asking about the GPU driver version.

I missed the info(Graphics: Intel Iris Xe Graphics (driver version 27.20.100.9268)) in the original post.

Please try updating the GPU driver version. And it's most likely we don't support this GPU because pytorch-directml leverages [DirectML] which has the requirements for Intel GPUs

You can verify that by running:

import torch
import torch_directml
torch_directml.is_available()
torch_directml.device_count()

torch_directml.is_available() returns False or torch_directml.device_count() returns 0 would mean that either the GPU or GPU driver is not supported.

tylertitsworth commented 1 year ago

Finding the same error here with a Radeon RX 6700XT.

$ neofetch
            .-/+oossssoo+/-.               xxxx@xxxxxxxx
        `:+ssssssssssssssssss+:`           ---------------
      -+ssssssssssssssssssyyssss+-         OS: Ubuntu 22.04.1 LTS on Windows 10 x86_64
    .ossssssssssssssssssdMMMNysssso.       Kernel: 5.15.90.1-microsoft-standard-WSL2
   /ssssssssssshdmmNNmmyNMMMMhssssss/      Uptime: 1 hour, 37 mins
  +ssssssssshmydMMMMMMMNddddyssssssss+     Packages: 1250 (dpkg)
 /sssssssshNMMMyhhyyyyhmNMMMNhssssssss/    Shell: bash 5.1.16
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Theme: Adwaita [GTK3]
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Icons: Adwaita [GTK3]
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   Terminal: Windows Terminal
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   CPU: 12th Gen Intel i9-12900K (24) @ 3.187GHz
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   GPU: ea92:00:00.0 Microsoft Corporation Device 008e
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Memory: 2751MiB / 15887MiB
 /sssssssshNMMMyhhyyyyhdNMMMNhssssssss/
  +sssssssssdmydMMMMMMMMddddyssssssss+
   /ssssssssssshdmNNNNmyNMMMMhssssss/
    .ossssssssssssssssssdMMMNysssso.
      -+sssssssssssssssssyyyssss+-
        `:+ssssssssssssssssss+:`
            .-/+oossssoo+/-.

❯ wsl --version
WSL version: 1.2.5.0
Kernel version: 5.15.90.1
WSLg version: 1.0.51
MSRDC version: 1.2.3770
Direct3D version: 1.608.2-61064218
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22621.1702

$ python
Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torch_directml
>>> torch_directml.is_available()
True
>>> torch_directml.device_count()
2

$ pip freeze
certifi==2022.12.7
charset-normalizer==2.1.1
filelock==3.9.0
idna==3.4
Jinja2==3.1.2
MarkupSafe==2.1.2
mpmath==1.2.1
networkx==3.0
numpy==1.24.1
Pillow==9.3.0
requests==2.28.1
sympy==1.11.1
torch==2.0.0+cpu
torch-directml==0.2.0.dev230426
torchaudio==2.0.0+cpu
torchvision==0.15.1
typing_extensions==4.4.0
urllib3==1.26.13

$ python
Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torch_directml
>>> dml = torch_directml.device()
>>> tensor1 = torch.tensor([1]).to(dml)
Segmentation fault

I also can't get an example to work

# I have already installed the requirements.txt file and ran dataset.py
$ python PyTorch/1.8/resnet50/train.py
/home/xxxx/venv/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libc10_cuda.so: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
Traceback (most recent call last):
  File "/home/xxxx/DirectML/PyTorch/1.8/resnet50/train.py", line 35, in <module>
    main()
  File "/home/xxxx/DirectML/PyTorch/1.8/resnet50/train.py", line 30, in main
    train(args.path, args.batch_size, args.epochs, args.learning_rate,
  File "/home/xxxx/DirectML/PyTorch/1.8/classification/train_classification.py", line 111, in main
    model = get_model(model_str, device)
  File "/home/xxxx/DirectML/PyTorch/1.8/classification/test_classification.py", line 76, in get_model
    model = models.resnet50(num_classes=10).to(device)
  File "/home/xxxx/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1126, in to
    device, dtype, non_blocking, convert_to_format = torch._C._nn._parse_to(*args, **kwargs)
RuntimeError: Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, fpga, ort, xla, lazy, vulkan, mps, meta, hpu, mtia, privateuseone device type at start of device string: dml

Is there a stable, recommended version of torch-directml, and where if any is that in a requirments.txt file?

microsoft / DirectML

Segmentation fault on WSL2 #378