microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.32k stars 814 forks source link

Cannot use GPU support in Windows 10 preview build 21354 #6773

Closed craigloewen-msft closed 3 years ago

craigloewen-msft commented 3 years ago

Notes from the dev team

Hey everyone we're filing this issue on ourselves to track the fact that GPU Compute support is unavailable in preview build 21354. We have identified the fix for this already and it should likely be out in the next Insiders build. Thanks for your patience here (again 😅) as we resolve this issue!

Environment

Windows build number: 21354
Your Distribution version: All distros
Whether the issue is on WSL 2 and/or WSL 1: WSL 2

Steps to reproduce

Trying to use any Linux application that leverages the GPU on Windows will not work.

Expected behavior

I should be able to use Linux applications that leverage the GPU.

Actual behavior

Any access to the vGPU in WSL will fail.

patfla commented 3 years ago

Installed 21354 yesterday. Any idea when the next version, with the gpu fix, will drop?

craigloewen-msft commented 3 years ago

This fix should be available the next Windows Insiders preview build.

achernev commented 3 years ago

@craigloewen-msft I appreciate you guys are working hard and I shouldn't expect rock-hard stability in this line of Windows development but it being the only way of using CUDA under WSL in any meaningful way, is there a chance that a team member does a

docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark

before a release to ensure that works? Assuming it is an oversight and not some fundamental incompatibility every time. This is the third time this has happened in the past few months, and judging by the amount of people who comment on and star these issues I know I am not alone. It doesn't help that my computer restarts itself overnight leaving me with a non-functioning system (I know I can turn that off).

Anyway, thanks for all the work you do and I look forward to the fix for this.

thearperson commented 3 years ago

@craigloewen-msft I appreciate you guys are working hard and I shouldn't expect rock-hard stability in this line of Windows development but it being the only way of using CUDA under WSL in any meaningful way, is there a chance that a team member does a

docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark

before a release to ensure that works? Assuming it is an oversight and not some fundamental incompatibility every time. This is the third time this has happened in the past few months, and judging by the amount of people who comment on and star these issues I know I am not alone. It doesn't help that my computer restarts itself overnight leaving me with a non-functioning system (I know I can turn that off).

Anyway, thanks for all the work you do and I look forward to the fix for this.

+1

Actually, would be nice to have some way to not living on the bleeding edge to get WSL or Cuda on Windows. Would consider paying for our use case :)

ecly commented 3 years ago

Indeed, is there any information wrt. when it will no longer require insider build for CUDA in WSL?

patfla commented 3 years ago

no longer require.

Exactly one of my questions. My other question is: when will the next version, with the gpu fix, drop? I figure that with requiring Insider for WSL CUDA, MS "acquires developer buy-in." So to speak.

My last Linux box died a while ago - I should really get around to replacing it. Had high hopes for WSL[2].

xrstokes commented 3 years ago

Two days i spent trying to get it working because i specifically moved to the dev channel for this feature. I'd pay for windows 10 pro for workstation to get this feature. Why don't you enable DDA in windows 10 pro for workstation while this gets ironed out. I don't think i'd like to live with server2019 as my daily driver just to pass through a second gpu to a linux VM but i might have to. How long is it roughly to the next build? Thanks

astroboylrx commented 3 years ago

Besides the GPU issue, I found that I suddenly cannot ssh into WSL2 from Windows using my usual ssh myself@localhost -p2222 command after upgrading to build 21354.1. The error in the verbose mode looks like

debug3: finish_connect - ERROR: async io completed with error: 10061, io:0000025310BDCFE0

Simply restart the sshd service won't fix it. Somehow I have to change the port number in my /etc/ssh/sshd_config file and restart the sshd service to make this work again. Surprisingly, changing the port number to something else (e.g., 2225) works; changing it back to 2222 afterward also works fine. Do you happen to know if this is a related issue and will be fixed in the next build? Many thanks!

craigloewen-msft commented 3 years ago

@craigloewen-msft I appreciate you guys are working hard and I shouldn't expect rock-hard stability in this line of Windows development but it being the only way of using CUDA under WSL in any meaningful way, is there a chance that a team member does a

docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark

before a release to ensure that works? Assuming it is an oversight and not some fundamental incompatibility every time. This is the third time this has happened in the past few months, and judging by the amount of people who comment on and star these issues I know I am not alone. It doesn't help that my computer restarts itself overnight leaving me with a non-functioning system (I know I can turn that off).

Anyway, thanks for all the work you do and I look forward to the fix for this.

Agreed! We actually do run GPU tests in WSL that are part of our regular process to ensure that we are not breaking anything GPU related when we make any changes to WSL. These last breakages have exposed some gaps in our testing and we are working on fixing them as we definitely don't want to break you either :) . Additionally, GPU compute will be part of the next major Windows release.

0x1orz commented 3 years ago

@craigloewen-msft Cuda in WLS2 5.4.72 with WIP 21354.1 worked normally, while abnormally in WLS2 5.4.91 updated at today. I had attempted to turn off and then turn on WLS2, or Disable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux, also failed to install the pkg wls_update.msi from https://docs.microsoft.com/en-us/windows/wsl/install-win10. So then, how to roll balk to WLS2 5.4.72 ? :( or completely uninstall the WLS2?

AugustKRZhu commented 3 years ago

Hi,@craigloewen-msft Waiting for the next release. After the automatic version upgrade, the GPU cannot be run, so I wait for the upgrade and proceed to the DL project. Thank you.

xrstokes commented 3 years ago

https://archive.org/details/Windows-10-Build-21286 < I'm desperate so reinstalling the last build.

jhirschibar commented 3 years ago

Does this mean it is not fixed? https://blogs.windows.com/windows-insider/2021/04/14/announcing-windows-10-insider-preview-build-21359/ Bullet point in known issues about vGPU?

jhirschibar commented 3 years ago

@craigloewen-msft I appreciate you guys are working hard and I shouldn't expect rock-hard stability in this line of Windows development but it being the only way of using CUDA under WSL in any meaningful way, is there a chance that a team member does a

docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark

before a release to ensure that works? Assuming it is an oversight and not some fundamental incompatibility every time. This is the third time this has happened in the past few months, and judging by the amount of people who comment on and star these issues I know I am not alone. It doesn't help that my computer restarts itself overnight leaving me with a non-functioning system (I know I can turn that off). Anyway, thanks for all the work you do and I look forward to the fix for this.

Agreed! We actually do run GPU tests in WSL that are part of our regular process to ensure that we are not breaking anything GPU related when we make any changes to WSL. These last breakages have exposed some gaps in our testing and we are working on fixing them as we definitely don't want to break you either :) . Additionally, GPU compute will be part of the next major Windows release.

Major windows release meaning 21H2?

craigloewen-msft commented 3 years ago

This issue has been fixed in the latest preview build 21359, please upgrade your systems and you'll have vGPU access again! I am working on getting the changelog there updated to reflect that vGPU is fixed. Thank you all for your patience!

doktor-ziel commented 3 years ago

I'm afraid it still doesn't work correctly: image I have this 470.14_gameready_win10-dch_64bit_international driver installed. And I get the following error:

$ docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
docker: Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.2, please update your driver to a newer version, or use an earlier cuda container: unknown.
ERRO[0001] error waiting for container: context canceled
doktor-ziel commented 3 years ago

I think it can be somehow related to this:

./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce GTX 1060 with Max-Q Design"
  CUDA Driver Version / Runtime Version          11.3 / 11.2
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 6144 MBytes (6442450944 bytes)
  (10) Multiprocessors, (128) CUDA Cores/MP:     1280 CUDA Cores
  GPU Max Clock rate:                            1342 MHz (1.34 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.3, CUDA Runtime Version = 11.2, NumDevs = 1
Result = PASS

Is it possible that this version mismatch causes my error?

thearperson commented 3 years ago

Actually, I have a separate issue:

$ nvidia-smi                                                                                                 [15:09:13]
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Failed to properly shut down NVML: Driver Not Loaded
doktor-ziel commented 3 years ago

Actually, I have this issue too

doktor-ziel commented 3 years ago

I'm really fresh here, but I think this:

$ nvidia-container-cli info
NVRM version:   460.0
CUDA version:   11.0

Device Index:   0
Device Minor:   0
Model:          UNKNOWN
Brand:          UNKNOWN
GPU UUID:       GPU-00000000-0000-0000-0000-000000000000
Bus Location:   0
Architecture:   UNKNOWN

is strange

thearperson commented 3 years ago

Turns out here is a way to bypass the version check:

https://github.com/NVIDIA/nvidia-container-toolkit/issues/148

$ docker run --gpus all --env NVIDIA_DISABLE_REQUIRE=1 nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark

Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
        -fullscreen       (run n-body simulation in fullscreen mode)
        -fp64             (use double precision floating point values for simulation)
        -hostmem          (stores simulation data in host memory)
        -benchmark        (run benchmark to measure performance)
        -numbodies=<N>    (number of bodies (>= 1) to run in simulation)
        -device=<d>       (where d=0,1,2.... for the CUDA device to use)
        -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
        -compare          (compares simulation results running once on the default GPU and once on the CPU)
        -cpu              (run n-body simulation on the CPU)
        -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.6

> Compute 8.6 CUDA device: [NVIDIA GeForce RTX 3090]
83968 bodies, total time for 10 iterations: 71.278 ms
= 989.179 billion interactions per second
= 19783.574 single-precision GFLOP/s at 20 flops per interaction
AugustKRZhu commented 3 years ago

This issue has been fixed in the latest preview build 21359, please upgrade your systems and you'll have vGPU access again! I am working on getting the changelog there updated to reflect that vGPU is fixed. Thank you all for your patience!

Hi, Thanks, It's great and timely.

xrstokes commented 3 years ago

Oh Man, I brought a second hand PC yesterday just to run the old build and get ML going again. All I had to do was wait 24hrs. Thanks for getting it going again though. It will be faster on my main machine and it shows that there is commitment to the cause.

jhirschibar commented 3 years ago

This issue has been fixed in the latest preview build 21359, please upgrade your systems and you'll have vGPU access again! I am working on getting the changelog there updated to reflect that vGPU is fixed. Thank you all for your patience!

Hi, Thanks, It's great and timely.

Agreed! Great to have it working again, thanks everyone!

Asheeshg commented 3 years ago

It worked for me too. However, now getting the error: "only 0 Devices available, 1 requested. Exiting." Any idea how this can be resolved?

$ sudo docker run --gpus all --env NVIDIA_DISABLE_REQUIRE=1 nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark Run "nbody -benchmark [-numbodies=]" to measure performance. -fullscreen (run n-body simulation in fullscreen mode) -fp64 (use double precision floating point values for simulation) -hostmem (stores simulation data in host memory) -benchmark (run benchmark to measure performance) -numbodies= (number of bodies (>= 1) to run in simulation) -device= (where d=0,1,2.... for the CUDA device to use) -numdevices= (where i=(number of CUDA devices > 0) to use for simulation) -compare (compares simulation results running once on the default GPU and once on the CPU) -cpu (run n-body simulation on the CPU) -tipsy= (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Error: only 0 Devices available, 1 requested. Exiting.

chro89 commented 3 years ago

The issue still persists for me. Build: 21359.1 Nvidia-Driver: 470.14

nvidia-smi.exe works just fine, but the linux command returns

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Failed to properly shut down NVML: Driver Not Loaded

kernfel commented 3 years ago

nvidia-smi.exe works just fine, but the linux command returns

Same here -- but cuda is running without any issues. This might just be an issue with nvidia-smi at this point.

achernev commented 3 years ago

Could be an NVML thing. nvidia-smi worked with driver version 465.42 and stopped working once I upgraded to 470.14 just now. That's on build 21359.

patfla commented 3 years ago

Upgraded to 21359 - both nvidia-smi and deviceQuery failed with the same errors as before.

What Nvidia driver version do I have? 465.89. Check to see if that's the latest (https://www.nvidia.com/Download/index.aspx?lang=en-us). No, there's a 466.11

Install that. Didn't help. Same errors again for nvidia-smi and deviceQuery.

MyId@DESKTOP-930EG0A:~$ /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 100 -> no CUDA-capable device is detected Result = FAIL MyId@DESKTOP-930EG0A:~$ nvidia-smi NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

MyId@DESKTOP-930EG0A:~$

For good measure reboot. Didn't help.

As a practical concern, it looks to me like CUDA programming inside WSL2 is a lost cause.

I don't have a lot of cycles for WSL. I'm spending too much time testing out MS's latest code (and writing messages like this).

Mostly my concern now is how to uninstall Windows Insider Program - except when I looked at that before it didn't seem all that obvious. You can't back up. It seemed the message was: wait until Insider and release versions are synced up then exit. Or something to that effect.

patfla commented 3 years ago

Oh right - this is now a closed topic. That;'s great. Closed so far as MS is concerned that is.

jhirschibar commented 3 years ago

Could be an NVML thing. nvidia-smi worked with driver version 465.42 and stopped working once I upgraded to 470.14 just now. That's on build 21359.

Yeah, i think that's the case. Sometimes nvidia-smi requires a bit of a work around. It was fixed with 465.42, but 470.14 probably still needs it

achernev commented 3 years ago

Oh right - this is now a closed topic. That;'s great. Closed so far as MS is concerned that is.

Yes, Microsoft has no interest in your NVIDIA problems which are completely unrelated to the original issue under which you are commenting. The issue is closed because it was corrected in the latest build.

patfla commented 3 years ago

Thanks Anton - that's very useful.

craigloewen-msft commented 3 years ago

I've closed this issue out as the original issue breaking GPU support that this issue is tracking is now fixed. If you're still experiencing GPU related problems please open another issue. From the discussion above it seems like this is NVIDIA related, we're also taking a look internally at what could be causing this issue but do not have repros on the few machines we have tried it on.

As a kind reminder, this repository abides by the Microsoft Code of Conduct and we encourage users to be polite and respectful to others on this forum. Thanks all!

Asheeshg commented 3 years ago

Seems like this change is not migrated to 21364 build. So please be on 21359 build only.

Marietto2008 commented 3 years ago

this bug is not fixed on Windows 21376co_release.210503-1432

https://github.com/microsoft/WSL/issues/6925

mtrabado commented 3 years ago

I actuallu have version 21354, how can I update to version 21359?

Marietto2008 commented 1 year ago

Hello to everyone.

Premising that I’m using “Microsoft Windows 11 [Version 10.0.22000.1219]”,I have installed Ubuntu 22.04 on top of it and then I’ve upgraded Ubuntu 22.04 to 22.10 and I’ve upgraded the driver for my RTX 2080 ti on the host os. After this,I have installed CUDA inside the WSL2 / Ubuntu following this mini tutorial :

https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_network

Get the latest feature updates to NVIDIA's proprietary compute stack.

These are the commands that I have issued :

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda

at this point I did :

apt install nvidia-utils-510

because without installing this package,it won’t find the utility nvidia-smi. But this is the error that I’ve got next :

# nvidia-smi

Failed to initialize NVML: GPU access blocked by the operating system
Failed to properly shut down NVML: GPU access blocked by the operating system

what should I do now ? The driver installed on the host os is version : 31.0.15.2686