microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.25k stars 812 forks source link

CUDA Version return ERROR! after nvidia-smi command #11589

Open Spirit4471 opened 4 months ago

Spirit4471 commented 4 months ago

Windows Version

Microsoft Windows [Version 10.0.22631.3593]

WSL Version

WSL version: 2.1.5.0

Are you using WSL 1 or WSL 2?

Kernel Version

5.15.146.1

Distro Version

Ubuntu 22.04

Other Software

Visual Studio Code

Repro Steps

CUDA 11.7 PyTorch 1.13 cuDNN

Expected Behavior

20240515215342

Actual Behavior

nvidia-smi output: Wed May 15 21:49:41 2024
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.46 Driver Version: 546.80 CUDA Version: ERR! | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 ... On | 00000000:01:00.0 On | N/A | | N/A 54C P8 16W / 80W | 816MiB / 6144MiB | 3% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+

nvcc --version: nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_May__3_18:49:52_PDT_2022 Cuda compilation tools, release 11.7, V11.7.64 Build cuda_11.7.r11.7/compiler.31294372_0

Only the first I configure wsl2+Ubuntu+CUDA+Python+PyTorch development environment, the code worked, after I reboot the computer, my code can't get avaliable cuda and GPU.

Diagnostic Logs

No response

github-actions[bot] commented 4 months ago

Logs are required for review from WSL team

If this a feature request, please reply with '/feature'. If this is a question, reply with '/question'. Otherwise please attach logs by following the instructions below, your issue will not be reviewed unless they are added. These logs will help us understand what is going on in your machine.

How to collect WSL logs Download and execute [collect-wsl-logs.ps1](https://github.com/Microsoft/WSL/blob/master/diagnostics/collect-wsl-logs.ps1) in an **administrative powershell prompt**: ``` Invoke-WebRequest -UseBasicParsing "https://raw.githubusercontent.com/microsoft/WSL/master/diagnostics/collect-wsl-logs.ps1" -OutFile collect-wsl-logs.ps1 Set-ExecutionPolicy Bypass -Scope Process -Force .\collect-wsl-logs.ps1 ``` The scipt will output the path of the log file once done. Once completed please upload the output files to this Github issue. [Click here for more info on logging](https://github.com/microsoft/WSL/blob/master/CONTRIBUTING.md#8-collect-wsl-logs-recommended-method) If you choose to email these logs instead of attaching to the bug, please send them to wsl-gh-logs@microsoft.com with the number of the github issue in the subject, and in the message a link to your comment in the github issue and reply with '/emailed-logs'.

View similar issues

Please view the issues below to see if they solve your problem, and if the issue describes your problem please consider closing this one and thumbs upping the other issue to help us prioritize it!

Open similar issues:

Closed similar issues:

Note: You can give me feedback by thumbs upping or thumbs downing this comment.

Spirit4471 commented 4 months ago

WslLogs-2024-05-15_22-38-10.zip

github-actions[bot] commented 4 months ago

The log file doesn't contain any WSL traces. Please make sure that you reproduced the issue while the log collection was running.

Diagnostic information ``` .wslconfig found Detected appx version: 2.1.5.0 Found no WSL traces in the logs ```
Spirit4471 commented 4 months ago

WslLogs-2024-05-15_22-45-15.zip

github-actions[bot] commented 4 months ago
Diagnostic information ``` .wslconfig found Detected appx version: 2.1.5.0 ```
OneBlue commented 4 months ago

@Spirit4471: the nvidia-smi output seems to be correct, what exactly is the issue here ?

Spirit4471 commented 4 months ago

@Spirit4471: the nvidia-smi output seems to be correct, what exactly is the issue here ?

the nvidia-smi output not correct, you can the the CUDA version is ERR!, and print(f"CUDA is available: {torch.cuda.is_available()}") return false.

onomatopellan commented 4 months ago

NVIDIA-SMI 545.46 Driver Version: 546.80 CUDA Version: ERR!

$ nvidia-smi
Sat May 18 02:31:47 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.73.01              Driver Version: 552.12         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GT 1030         On  |   00000000:01:00.0  On |                  N/A |
| 30%   36C    P8             N/A /   30W |    1026MiB /   2048MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A       105      G   /Xwayland                                   N/A      |
+-----------------------------------------------------------------------------------------+

You need to update the nvidia GPU Windows drivers.

Spirit4471 commented 3 months ago

NVIDIA-SMI 545.46 Driver Version: 546.80 CUDA Version: ERR!

$ nvidia-smi
Sat May 18 02:31:47 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.73.01              Driver Version: 552.12         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GT 1030         On  |   00000000:01:00.0  On |                  N/A |
| 30%   36C    P8             N/A /   30W |    1026MiB /   2048MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A       105      G   /Xwayland                                   N/A      |
+-----------------------------------------------------------------------------------------+

You need to update the nvidia GPU Windows drivers.

But when I type nvidia-smi in windows terminal, it output in a right way. When I type nvidia-smi in wsl2, it output ERR! in CUDA version.

onomatopellan commented 3 months ago

The WSL2 version depends on the Windows version. The nvidia-smi ELF64 binary inside WSL2 updates automatically after installing the latest Windows drivers for you GPU.

In fact in WSL2 the folder where nvidia-smi resides /usr/lib/wsl/lib is just a mount that points to the DriverStore Windows folder.

That's why even Nvidia themselves recommends: Install the Windows 11 nvidia display driver. This is the only driver you need to install. Do not install any Linux display driver in WSL.

Spirit4471 commented 3 months ago

The WSL2 version depends on the Windows version. The nvidia-smi ELF64 binary inside WSL2 updates automatically after installing the latest Windows drivers for you GPU.

In fact in WSL2 the folder where nvidia-smi resides /usr/lib/wsl/lib is just a mount that points to the DriverStore Windows folder.

That's why even Nvidia themselves recommends: Install the Windows 11 nvidia display driver. This is the only driver you need to install. Do not install any Linux display driver in WSL.

I know, I didn't install CUDA toolkit in wsl2, wsl2 is using the driver on windows. Actually, when I first set up the develop environment, everything works perfectly, after I restart the computer, the develop environment seems got problem, and nvidia-smi command output ERR! in CUDA version.

onomatopellan commented 3 months ago

I would try to install another distro. If nvidia-smi works there without error then the problem could be some Ubuntu 22.04 package update.

ab77c commented 1 month ago

I had similar problem with Podman running on WSL2. In my case the problem was fixed by generating CDI spec again after recent driver upgrade. This needs to be done after each GPU driver update on host machine according to the related Nvidia CTK doc page

If you change the device or CUDA driver configuration, you must generate a new CDI specification. A configuration change can occur when MIG devices are created or removed, or when the driver is upgraded.