Open themizzi opened 4 months ago
I'm wondering if its a question of maintaining support for the current Ubuntu Version by nvidia/msft and if Ubuntu 24.04 in Wsl works with nvidia driver versions past 538 because that would mean support for CUDA past 12.2. i need to get past 12.2 in order to to match compatibility on jax >=12.3 and other libraries.
if its the case that wsl2/ubuntu 24.04/driver 55x works. then i can think about transitioning to Ubuntu 24 from 22
Hi im also having this issue under WSL2 have tried with ubuntu 22.04 and 24.04 and the result is the same. I've tried downgrading my driver to 538, but my gpu rtx 4050 is not compatible with it as its too old. Any ideas how i can make this work? I really need to be able to use the gpu inside WSL
This same here WSL2 Ubuntu 22.04:
Fri Jun 21 13:11:06 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.52.01 Driver Version: 555.99 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
Segmentation fault
Windows 11:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.99 Driver Version: 555.99 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX 2000 Ada Gene... WDDM | 00000000:01:00.0 Off | N/A |
| N/A 43C P3 11W / 43W | 0MiB / 8188MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
@mjjank this is already "solved" by turning down the driver version. see above.
@jsvetrifork 537 not 538 - the latter is the first version to break on WSL2. If your GPU doesn't support older drivers you'd probably be out of luck and would have to run stuff in docker on windows. That's how I am running higher driver versions these days (so no WSL for this). Granted.. if this is not a laptop youd be hard pressed to run anything but small uni sample code on it. Windows takes ~20% of any GPU VRAM that has a Monitor attached...
@jsvetrifork I can confirm that 538 does not solve the problem. What is interesting is that I can, even if there is a segmentation fault in SMI, still use it in Python:
In [21]: from numba import cuda
In [22]: device=cuda.select_device(0)
In [23]: device.name
Out[23]: b'NVIDIA RTX 2000 Ada Generation Laptop GPU'
In [24]:
The segmentation fault appears to happen in NVML (libnvidia-ml.so.1
):
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 552.55 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff66809a1 in ?? () from /usr/lib/wsl/drivers/nvltwi.inf_amd64_53dae1bddc8c687f/libnvidia-ml.so.1
As a smoke test, I tried generating some text using a HF Transformers language model. It didn't crash and the output appeared fine.
As long as you don't need NVML, you might be fine. From stack trace above, it's not clear whether the problem is on the nvidia-smi
or libnvidia-ml.so.1
side.
The Windows version of nvidia-smi can be run using nvidia-smi.exe
from WSL. You can only see the total GPU utilization, though. The per-process break-down is not displayed:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 552.55 Driver Version: 552.55 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA T500 WDDM | 00000000:01:00.0 Off | N/A |
| N/A 59C P0 N/A / ERR! | 1563MiB / 4096MiB | 93% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
Has anybody tried some other NVML-based monitoring tool that worked under the older driver versions?
Other sub-commands like nvidia-smi dmon
work.
The command nvidia-smi pmon
doesn't crash, but it only repeats the following line, even when there are active processes:
0 - - - - - - - - - - -
The command nvidia-smi -q -d MEMORY,UTILIZATION,ECC,TEMPERATURE,POWER,CLOCK,COMPUTE,PIDS,PERFORMANCE,SUPPORTED_CLOCKS,PAGE_RETIREMENT,ACCOUNTING,ENCODER_STATS,SUPPORTED_GPU_TARGET_TEMP,VOLTAGE,FBC_STATS,ROW_REMAPPER,RESET_STATUS,GSP_FIRMWARE_VERSION
works without crashing.
Without the -d
, nvidia-smi -q
crashes after Product Architecture : Turing
, in the position where nvidia-smi.exe -q
outputs the line Display Mode : Disabled
.
I'm wondering if its a question of maintaining support for the current Ubuntu Version by nvidia/msft and if Ubuntu 24.04 in Wsl works with nvidia driver versions past 538 because that would mean support for CUDA past 12.2. i need to get past 12.2 in order to to match compatibility on jax >=12.3 and other libraries.
if its the case that wsl2/ubuntu 24.04/driver 55x works. then i can think about transitioning to Ubuntu 24 from 22
Are there any WSL people even in this group, I'd love to be able to update my gpu drivers at some point? is this nvidia's problem!? microsoft's problem!?
In my experience, it's a hit or miss issue depending on the nvidia driver version. With wsl-2.3.11
and nvidia-555.99
it works:
elsaco@eleven:~$ wslinfo --wsl-version
2.3.11
elsaco@eleven:~$ nvidia-smi
Fri Jul 19 13:41:16 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.52.01 Driver Version: 555.99 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 ... On | 00000000:01:00.0 On | N/A |
| 0% 43C P8 3W / 220W | 576MiB / 12282MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
However, there were times when nvidia-smi would segfault. It only takes one update for it to fail again!
@kziemski I've been using WSL2 with 54x and 55x versions. I can run Pytorch with CUDA, NVidia Container Toolkit etc.. inside my WSL2 Ubuntu without any issues. My code can utilize CUDA normally. I think if you don't care about nvidia-smi
giving errors, you can try to upgrade the drivers to see if it has any impact on your code.
@kziemski I've been using WSL2 with 54x and 55x versions. I can run Pytorch with CUDA, NVidia Container Toolkit etc.. inside my WSL2 Ubuntu without any issues. My code can utilize CUDA normally. I think if you don't care about
nvidia-smi
giving errors, you can try to upgrade the drivers to see if it has any impact on your code.
I think this is bit confusing because i thought this issue was tied to and a common issue with gpu device not being found within wsl2 and docker via wsl2 the last version that worked was 537.58 afterwards running nbody for instance causes nvidia device not found. I've been waiting for a version past 537.58 that will work in ubuntu 22.04.
@kziemski I thought so too as the first thing I did after installing WSL2 was to nvidia-smi
to check for the presence of the GPU. Turns out, only nvidia-smi
gives errors, everything else seems to work normally.
@kziemski I've been using WSL2 with 54x and 55x versions. I can run Pytorch with CUDA, NVidia Container Toolkit etc.. inside my WSL2 Ubuntu without any issues. My code can utilize CUDA normally. I think if you don't care about
nvidia-smi
giving errors, you can try to upgrade the drivers to see if it has any impact on your code.I think this is bit confusing because i thought this issue was tied to and a common issue with gpu device not being found within wsl2 and docker via wsl2 the last version that worked was 537.58 afterwards running nbody for instance causes nvidia device not found. I've been waiting for a version past 537.58 that will work in ubuntu 22.04.
@kziemski I thought so too as the first thing I did after installing WSL2 was to
nvidia-smi
to check for the presence of the GPU. Turns out, onlynvidia-smi
gives errors, everything else seems to work normally.
@AlexTo , alex as of some version 55x.xx it still wasn't working but as of 560.70 it does work. will stick with this driver for awhile but i think now that i'm in 56x.xx territory and not 537.xx it allows access to something can't remember at the moment.
device not found definitely happened with the last 55x i tried when running a nbody sample so today's a good day.
@kziemski interesting, so far, all versions (538, 54x, 555, 559) work for me but I'm on a RTX/Quadro not Geforce series.
@kziemski interesting, so far, all versions (538, 54x, 555, 559) work for me but I'm on a RTX/Quadro not Geforce series.
@AlexTo might be the difference. the way wslg the wsl libs and docker desktop all function i'm honestly suprised it works at all.
Windows Version
10.0.22631.3235
WSL Version
2.1.4.0
Are you using WSL 1 or WSL 2?
Kernel Version
5.15.146.1-2
Distro Version
Ubuntu 22.04
Other Software
GeForce GTX 1650 Ti with GeForce Game Ready Driver version 551.76
Repro Steps
Run
nvidia-smi
in Windows and get the following:Run
nvidia-smi
in wsl2 Ubuntu and get the following:Expected Behavior
I am expecting no segmentation fault and successful output in WSL 2.
Actual Behavior
I get a segmentation fault in WSL2 as described above.
Diagnostic Logs
No response