nvidia-smi segmentation fault in wsl2 but not in Windows

microsoft / WSL

Issues found on WSL

https://docs.microsoft.com/windows/wsl

MIT License

17.03k stars 798 forks source link

nvidia-smi segmentation fault in wsl2 but not in Windows #11277

Open themizzi opened 4 months ago

themizzi commented 4 months ago

Windows Version

10.0.22631.3235

WSL Version

2.1.4.0

Are you using WSL 1 or WSL 2?

[X] WSL 2
[ ] WSL 1

Kernel Version

5.15.146.1-2

Distro Version

Ubuntu 22.04

Other Software

GeForce GTX 1650 Ti with GeForce Game Ready Driver version 551.76

Repro Steps

Run nvidia-smi in Windows and get the following:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 551.76                 Driver Version: 551.76         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1650 Ti   WDDM  |   00000000:01:00.0 Off |                  N/A |
| N/A   63C    P8              3W /   50W |     163MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      3268    C+G   ...ekyb3d8bbwe\WsaClient\WsaClient.exe      N/A      |
|    0   N/A  N/A     18112    C+G   ...ience\NVIDIA GeForce Experience.exe      N/A      |
+-----------------------------------------------------------------------------------------+

Run nvidia-smi in wsl2 Ubuntu and get the following:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 551.76       CUDA Version: 12.4     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
[1]    2058 segmentation fault  nvidia-smi

Expected Behavior

I am expecting no segmentation fault and successful output in WSL 2.

Actual Behavior

I get a segmentation fault in WSL2 as described above.

Diagnostic Logs

No response

kziemski commented 1 month ago

I'm wondering if its a question of maintaining support for the current Ubuntu Version by nvidia/msft and if Ubuntu 24.04 in Wsl works with nvidia driver versions past 538 because that would mean support for CUDA past 12.2. i need to get past 12.2 in order to to match compatibility on jax >=12.3 and other libraries.

if its the case that wsl2/ubuntu 24.04/driver 55x works. then i can think about transitioning to Ubuntu 24 from 22

jsvetrifork commented 1 month ago

Hi im also having this issue under WSL2 have tried with ubuntu 22.04 and 24.04 and the result is the same. I've tried downgrading my driver to 538, but my gpu rtx 4050 is not compatible with it as its too old. Any ideas how i can make this work? I really need to be able to use the gpu inside WSL

mjjank commented 1 month ago

This same here WSL2 Ubuntu 22.04:

Fri Jun 21 13:11:06 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.52.01              Driver Version: 555.99         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
Segmentation fault

Windows 11:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.99                 Driver Version: 555.99         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 2000 Ada Gene...  WDDM  |   00000000:01:00.0 Off |                  N/A |
| N/A   43C    P3             11W /   43W |       0MiB /   8188MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

gnaaromat commented 4 weeks ago

@mjjank this is already "solved" by turning down the driver version. see above.

@jsvetrifork 537 not 538 - the latter is the first version to break on WSL2. If your GPU doesn't support older drivers you'd probably be out of luck and would have to run stuff in docker on windows. That's how I am running higher driver versions these days (so no WSL for this). Granted.. if this is not a laptop youd be hard pressed to run anything but small uni sample code on it. Windows takes ~20% of any GPU VRAM that has a Monitor attached...

mjjank commented 4 weeks ago

@jsvetrifork I can confirm that 538 does not solve the problem. What is interesting is that I can, even if there is a segmentation fault in SMI, still use it in Python:

In [21]: from numba import cuda

In [22]: device=cuda.select_device(0)

In [23]: device.name
Out[23]: b'NVIDIA RTX 2000 Ada Generation Laptop GPU'

In [24]:

stadlerb commented 3 weeks ago

The segmentation fault appears to happen in NVML (libnvidia-ml.so.1):

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 552.55         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff66809a1 in ?? () from /usr/lib/wsl/drivers/nvltwi.inf_amd64_53dae1bddc8c687f/libnvidia-ml.so.1

As a smoke test, I tried generating some text using a HF Transformers language model. It didn't crash and the output appeared fine.

As long as you don't need NVML, you might be fine. From stack trace above, it's not clear whether the problem is on the nvidia-smi or libnvidia-ml.so.1 side.

The Windows version of nvidia-smi can be run using nvidia-smi.exe from WSL. You can only see the total GPU utilization, though. The per-process break-down is not displayed:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 552.55                 Driver Version: 552.55         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA T500                  WDDM  |   00000000:01:00.0 Off |                  N/A |
| N/A   59C    P0             N/A / ERR!  |    1563MiB /   4096MiB |     93%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

Has anybody tried some other NVML-based monitoring tool that worked under the older driver versions?

stadlerb commented 3 weeks ago

Other sub-commands like nvidia-smi dmon work. The command nvidia-smi pmon doesn't crash, but it only repeats the following line, even when there are active processes:

    0          -     -      -      -      -      -      -      -      -      -    -

The command nvidia-smi -q -d MEMORY,UTILIZATION,ECC,TEMPERATURE,POWER,CLOCK,COMPUTE,PIDS,PERFORMANCE,SUPPORTED_CLOCKS,PAGE_RETIREMENT,ACCOUNTING,ENCODER_STATS,SUPPORTED_GPU_TARGET_TEMP,VOLTAGE,FBC_STATS,ROW_REMAPPER,RESET_STATUS,GSP_FIRMWARE_VERSION works without crashing.

Without the -d, nvidia-smi -q crashes after Product Architecture : Turing, in the position where nvidia-smi.exe -q outputs the line Display Mode : Disabled.

kziemski commented 3 days ago

I'm wondering if its a question of maintaining support for the current Ubuntu Version by nvidia/msft and if Ubuntu 24.04 in Wsl works with nvidia driver versions past 538 because that would mean support for CUDA past 12.2. i need to get past 12.2 in order to to match compatibility on jax >=12.3 and other libraries.

if its the case that wsl2/ubuntu 24.04/driver 55x works. then i can think about transitioning to Ubuntu 24 from 22

Are there any WSL people even in this group, I'd love to be able to update my gpu drivers at some point? is this nvidia's problem!? microsoft's problem!?

elsaco commented 3 days ago

In my experience, it's a hit or miss issue depending on the nvidia driver version. With wsl-2.3.11 and nvidia-555.99 it works:

elsaco@eleven:~$ wslinfo --wsl-version
2.3.11
elsaco@eleven:~$ nvidia-smi
Fri Jul 19 13:41:16 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.52.01              Driver Version: 555.99         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    On  |   00000000:01:00.0  On |                  N/A |
|  0%   43C    P8              3W /  220W |     576MiB /  12282MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

However, there were times when nvidia-smi would segfault. It only takes one update for it to fail again!

AlexTo commented 2 days ago

@kziemski I've been using WSL2 with 54x and 55x versions. I can run Pytorch with CUDA, NVidia Container Toolkit etc.. inside my WSL2 Ubuntu without any issues. My code can utilize CUDA normally. I think if you don't care about nvidia-smi giving errors, you can try to upgrade the drivers to see if it has any impact on your code.

kziemski commented 2 days ago

@kziemski I've been using WSL2 with 54x and 55x versions. I can run Pytorch with CUDA, NVidia Container Toolkit etc.. inside my WSL2 Ubuntu without any issues. My code can utilize CUDA normally. I think if you don't care about nvidia-smi giving errors, you can try to upgrade the drivers to see if it has any impact on your code.

I think this is bit confusing because i thought this issue was tied to and a common issue with gpu device not being found within wsl2 and docker via wsl2 the last version that worked was 537.58 afterwards running nbody for instance causes nvidia device not found. I've been waiting for a version past 537.58 that will work in ubuntu 22.04.

AlexTo commented 2 days ago

@kziemski I thought so too as the first thing I did after installing WSL2 was to nvidia-smi to check for the presence of the GPU. Turns out, only nvidia-smi gives errors, everything else seems to work normally.

kziemski commented 2 days ago

@kziemski I've been using WSL2 with 54x and 55x versions. I can run Pytorch with CUDA, NVidia Container Toolkit etc.. inside my WSL2 Ubuntu without any issues. My code can utilize CUDA normally. I think if you don't care about nvidia-smi giving errors, you can try to upgrade the drivers to see if it has any impact on your code.

I think this is bit confusing because i thought this issue was tied to and a common issue with gpu device not being found within wsl2 and docker via wsl2 the last version that worked was 537.58 afterwards running nbody for instance causes nvidia device not found. I've been waiting for a version past 537.58 that will work in ubuntu 22.04.

@kziemski I thought so too as the first thing I did after installing WSL2 was to nvidia-smi to check for the presence of the GPU. Turns out, only nvidia-smi gives errors, everything else seems to work normally.

@AlexTo , alex as of some version 55x.xx it still wasn't working but as of 560.70 it does work. will stick with this driver for awhile but i think now that i'm in 56x.xx territory and not 537.xx it allows access to something can't remember at the moment.

device not found definitely happened with the last 55x i tried when running a nbody sample so today's a good day.

AlexTo commented 2 days ago

@kziemski interesting, so far, all versions (538, 54x, 555, 559) work for me but I'm on a RTX/Quadro not Geforce series.

kziemski commented 2 days ago

@kziemski interesting, so far, all versions (538, 54x, 555, 559) work for me but I'm on a RTX/Quadro not Geforce series.

@AlexTo might be the difference. the way wslg the wsl libs and docker desktop all function i'm honestly suprised it works at all.