microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.28k stars 813 forks source link

WSL2 & CUDA does not work [v20226] #6014

Closed noofaq closed 3 years ago

noofaq commented 3 years ago

Environment

Windows build number: 10.0.20226.0
Your Distribution version: 18.04 / 20.04
Whether the issue is on WSL 2 and/or WSL 1: Linux version 4.19.128-microsoft-standard (oe-user@oe-host) (gcc version 8.2.0 (GCC)) #1 SMP Tue Jun 23 12:58:10 UTC 2020

Steps to reproduce

Exactly followed instructions available at https://docs.nvidia.com/cuda/wsl-user-guide/index.html Tested on previously working Ubuntu WSL image (IIRC GPU last worked on 20206, than whole WSL2 stopped working) Tested also on newly created Ubuntu 18.04 and Ubuntu 20.04 images.

I have tested CUDA compatible NVIDIA drivers 455.41 & 460.20. I have tried removing all drivers etc. I have also tested using CUDA 10.2 & CUDA 11.0.

It was tested on two separate machines (one Intel + GTX1060, other Ryzen + RTX 2080Ti)

Issue tested directly in OS also in docker containers inside.

Example (directly in Ubuntu):

piotr@DESKTOP-FS6J3NT:/usr/local/cuda/samples/4_Finance/BlackScholes$ ./BlackScholes
[./BlackScholes] - Starting...
GPU Device 0: "Turing" with compute capability 7.5

Initializing data...
...allocating CPU memory for options.
...allocating GPU memory for options.
CUDA error at BlackScholes.cu:116 code=46(cudaErrorDevicesUnavailable) "cudaMalloc((void **)&d_CallResult, OPT_SZ)"

Example in container:

piotr@DESKTOP-FS6J3NT:/mnt/c/Users/pppnn$ docker run -it --gpus all -p 8888:8888 tensorflow/tensorflow:latest-gpu-py3-jupyter python
Python 3.6.9 (default, Nov  7 2019, 10:44:02)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2020-10-01 14:18:07.538627: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-10-01 14:18:07.624188: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
>>> tf.test.is_gpu_available()
WARNING:tensorflow:From <stdin>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2020-10-01 14:18:32.359457: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-10-01 14:18:32.398949: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3200035000 Hz
2020-10-01 14:18:32.402692: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3d06b70 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-01 14:18:32.402748: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-10-01 14:18:32.409370: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-10-01 14:18:32.877228: W tensorflow/compiler/xla/service/platform_util.cc:276] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_UNKNOWN: unknown error
2020-10-01 14:18:32.877370: I tensorflow/compiler/jit/xla_gpu_device.cc:136] Ignoring visible XLA_GPU_JIT device. Device number is 0, reason: Internal: no supported devices found for platform CUDA
2020-10-01 14:18:32.879904: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:1d:00.0/numa_node
Your kernel may have been built without NUMA support.
2020-10-01 14:18:32.880192: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:1d:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.665GHz coreCount: 68 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 573.69GiB/s
2020-10-01 14:18:32.880277: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-10-01 14:18:32.880340: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-10-01 14:18:32.959947: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-10-01 14:18:32.973554: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-10-01 14:18:33.111736: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-10-01 14:18:33.127902: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-10-01 14:18:33.128018: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-10-01 14:18:33.128535: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:1d:00.0/numa_node
Your kernel may have been built without NUMA support.
2020-10-01 14:18:33.129170: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:1d:00.0/numa_node
Your kernel may have been built without NUMA support.
2020-10-01 14:18:33.129403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-10-01 14:18:33.131671: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 1513, in is_gpu_available
    for local_device in device_lib.list_local_devices():
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/device_lib.py", line 43, in list_local_devices
    _convert(s) for s in _pywrap_device_lib.list_devices(serialized_config)
RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable
>>>
>>>
>>>
>>>
>>> tf.config.list_physical_devices('GPU')
2020-10-01 14:18:55.610151: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:1d:00.0/numa_node
Your kernel may have been built without NUMA support.
2020-10-01 14:18:55.610510: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:1d:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.665GHz coreCount: 68 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 573.69GiB/s
2020-10-01 14:18:55.610579: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-10-01 14:18:55.610623: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-10-01 14:18:55.610676: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-10-01 14:18:55.610719: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-10-01 14:18:55.610762: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-10-01 14:18:55.610805: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-10-01 14:18:55.610846: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-10-01 14:18:55.611251: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:1d:00.0/numa_node
Your kernel may have been built without NUMA support.
2020-10-01 14:18:55.611765: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:1d:00.0/numa_node
Your kernel may have been built without NUMA support.
2020-10-01 14:18:55.611999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
>>>
>>>
>>>
>>> tf.test.gpu_device_name()
2020-10-01 14:20:08.762060: W tensorflow/compiler/xla/service/platform_util.cc:276] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_UNKNOWN: unknown error
2020-10-01 14:20:08.762222: I tensorflow/compiler/jit/xla_gpu_device.cc:136] Ignoring visible XLA_GPU_JIT device. Device number is 0, reason: Internal: no supported devices found for platform CUDA
2020-10-01 14:20:08.762863: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:1d:00.0/numa_node
Your kernel may have been built without NUMA support.
2020-10-01 14:20:08.763201: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:1d:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.665GHz coreCount: 68 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 573.69GiB/s
2020-10-01 14:20:08.763263: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-10-01 14:20:08.763316: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-10-01 14:20:08.763358: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-10-01 14:20:08.763379: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-10-01 14:20:08.763428: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-10-01 14:20:08.763480: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-10-01 14:20:08.763533: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-10-01 14:20:08.763898: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:1d:00.0/numa_node
Your kernel may have been built without NUMA support.
2020-10-01 14:20:08.764536: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:1d:00.0/numa_node
Your kernel may have been built without NUMA support.
2020-10-01 14:20:08.764810: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 112, in gpu_device_name
    for x in device_lib.list_local_devices():
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/device_lib.py", line 43, in list_local_devices
    _convert(s) for s in _pywrap_device_lib.list_devices(serialized_config)
RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable
>>>

Expected behavior

CUDA working inside WSL2

Actual behavior

All tests which are using CUDA inside WSL Ubuntu are resulting with various CUDA errors - mostly referring to no CUDA devices available.

kunc commented 3 years ago

I am having the same issue. Everything was working flawlessly this morning but then I have updated to 20226.1000 from 20221.1000 and it does not work anymore (tried reinstalling nvidia drivers, etc.) with error that all cuda devices are busy or unavailable.

Edit: After going back to version 20221, everything works again, thus it confirms that the new version caused the problem.

benhillis commented 3 years ago

Can you share the contents of c:\Windows\System32\lxss\lib?

dfreelan commented 3 years ago

Having same issue. Here's my C:\WINDOWS\System32\lxss\lib.

09/17/2020 01:24 PM 124,664 libcuda.so 09/17/2020 01:24 PM 124,664 libcuda.so.1 09/17/2020 01:24 PM 124,664 libcuda.so.1.1 09/17/2020 01:24 PM 40,980,456 libnvwgf2umx.so

CarbonPool commented 3 years ago

Oh too bad, I also encountered this problem. I was so happy when wsl worked again in the 20226 version, but cuda couldn’t work. I was left out of the cold. I tried the following solutions, but none of them worked for me.

  1. Reinstall the graphics card driver 460.20.

  2. Recompile cuda dependent environment library.

  3. Uninstall wsl2 and kernel program and reinstall.

benhillis commented 3 years ago

Interesting, you seem to be missing the libdxcore libraries.

dfreelan commented 3 years ago

I reverted my windows back to the previous version, then reinstalled the 20226 build, and now it looks like this:

09/17/2020 01:24 PM 124,664 libcuda.so 09/17/2020 01:24 PM 124,664 libcuda.so.1 09/17/2020 01:24 PM 124,664 libcuda.so.1.1 09/26/2020 03:32 PM 832,936 libd3d12.so 09/26/2020 03:32 PM 5,115,392 libd3d12core.so 09/26/2020 03:32 PM 25,074,040 libdirectml.so 09/26/2020 03:32 PM 878,768 libdxcore.so 09/17/2020 01:24 PM 40,980,456 libnvwgf2umx.so

adamfarquhar commented 3 years ago

I am having the same problem. WIndows 10 build 20226 and Nvidia driver 460.20. It is great to see that it is not just my install. I hope that this can be fixed soon.

And now I can also confirm that it will work if you roll back to the previous build 20221. You can download the (old) iso file from Microsoft and re-install without losing any data.

jin8495 commented 3 years ago

Same problem here, Nvidia driver 460.20 and build 20226.

CarbonPool commented 3 years ago

可以共享c:\ Windows \ System32 \ lxss \ lib的内容吗?

lib_list

geneing commented 3 years ago

I have the same problem Nvidia driver 460.15, build 20226. It worked with the previous insider build.

noofaq commented 3 years ago

Can you share the contents of c:\Windows\System32\lxss\lib?

obraz

Looked into previous Windows version folder too: obraz

mitch-at-orika commented 3 years ago

Same problem Nvidia driver 460.20 and build 20226 my contents in lsxx\lib are: image

aticie commented 3 years ago

I have the same problem in 20226. My build also contains same 8 files in lxss\lib. But I get cudaErrorDevicesUnavailable.

Is there a way to roll back 20221? Using "Go back to previous version of Windows 10" sends me to 19041.508.

kunc commented 3 years ago

I have the same problem in 20226. My build also contains same 8 files in lxss\lib. But I get cudaErrorDevicesUnavailable.

Is there a way to roll back 20221? Using "Go back to previous version of Windows 10" sends me to 19041.508.

It worked for me. Are you sure you have went to the 20226 from 20221 - I think it might store only the last version as backup - the option is no longer available for me when I have reset from 20226 to 20221.

adamfarquhar commented 3 years ago

I have the same problem in 20226. My build also contains same 8 files in lxss\lib. But I get cudaErrorDevicesUnavailable.

Is there a way to roll back 20221? Using "Go back to previous version of Windows 10" sends me to 19041.508.

Yes, you can install 20221 from https://www.microsoft.com/en-us/software-download/windowsinsiderpreviewadvanced

kivancguckiran commented 3 years ago

It seems that it is not possible to downgrade windows without losing the apps and files which is not possible for me under these circumstances. Does anyone know another solution for this? Or we wait for Microsoft the fix the problem?

I too have version 10226.

PRIMA-LAB-IPU commented 3 years ago

Same here. `$ nvidia-smi.exe Fri Oct 2 23:54:29 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.15 Driver Version: 460.15 CUDA Version: 11.1 | |-------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce RTX 207... WDDM | 00000000:01:00.0 Off | N/A | | N/A 45C P5 12W / N/A | 176MiB / 8192MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1752 C+G Insufficient Permissions N/A | | 0 N/A N/A 2424 C+G ...b3d8bbwe\WinStore.App.exe N/A | | 0 N/A N/A 3500 C+G ...y\ShellExperienceHost.exe N/A | | 0 N/A N/A 5536 C+G ...m Files\VcXsrv\vcxsrv.exe N/A | | 0 N/A N/A 8288 C+G ...batNotificationClient.exe N/A | | 0 N/A N/A 10104 C+G C:\Windows\explorer.exe N/A | | 0 N/A N/A 11152 C+G ...qxf38zg5c\Skype\Skype.exe N/A | | 0 N/A N/A 11512 C+G ...artMenuExperienceHost.exe N/A | | 0 N/A N/A 11548 C+G ...ekyb3d8bbwe\YourPhone.exe N/A | | 0 N/A N/A 11832 C+G ...3m\Quick Eye\QuickEye.exe N/A | | 0 N/A N/A 11996 C+G ...8wekyb3d8bbwe\Cortana.exe N/A | | 0 N/A N/A 12856 C+G ...5n1h2txyewy\SearchApp.exe N/A | | 0 N/A N/A 13608 C+G ...2txyewy\TextInputHost.exe N/A | | 0 N/A N/A 14484 C+G ...re1.8.0_261\bin\javaw.exe N/A | | 0 N/A N/A 15152 C+G ...qxf38zg5c\Skype\Skype.exe N/A | | 0 N/A N/A 15620 C+G ...he8kybcnzg4\app\Slack.exe N/A | | 0 N/A N/A 16728 C+G ...ropbox\Client\Dropbox.exe N/A | | 0 N/A N/A 18824 C+G Insufficient Permissions N/A | | 0 N/A N/A 19316 C+G ...arp.BrowserSubprocess.exe N/A | | 0 N/A N/A 22372 C+G ...obeNotificationClient.exe N/A | +-----------------------------------------------------------------------------+`

ChengyuSheu commented 3 years ago

Thanks, @adamfarquhar. Rollback to version 20201 resolve this issue. Even though some settings are removed, files stay.

lminer commented 3 years ago

Same problem.

Rollback to previous version fixes it. For people who want to do it without reinstalling, go to Recovery > restore previous version of windows

aisensiy commented 3 years ago

I have the error remote procedure call failed in the last version, and I have this issue after upgrade. So...when I recovery does it mean I will get the remote procedure call failed back 😿

sirisian commented 3 years ago

@kivancguckiran I just joined the insider build so I'm in the same boat. It would probably take like 4 hours, but you could probably revert windows to the previous version (non-insider) maybe then go specifically to 20221. I'm not going to try it and just wait though.

strarsis commented 3 years ago

+1, same issue here.

The kernel, driver and other versions are above the required minimum, so CUDA in WSL 2 should work. However, when running the NVIDIA samples built with make, they always fail to run:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL
cat /usr/local/cuda/version.txt
CUDA Version 11.0.228
bbongcol commented 3 years ago

I have the same problem in 20226.

Cuda device query is ok.

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce RTX 2060" CUDA Driver Version / Runtime Version 11.2 / 10.0 CUDA Capability Major/Minor version number: 7.5 Total amount of global memory: 6144 MBytes (6442450944 bytes) (30) Multiprocessors, ( 64) CUDA Cores/MP: 1920 CUDA Cores GPU Max Clock rate: 1200 MHz (1.20 GHz) Memory Clock rate: 7001 Mhz Memory Bus Width: 192-bit L2 Cache Size: 3145728 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1024 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.2, CUDA Runtime Version = 10.0, NumDevs = 1 Result = PASS

But cuda utility does not worked.

[./BlackScholes] - Starting... GPU Device 0: "GeForce RTX 2060" with compute capability 7.5

Initializing data... ...allocating CPU memory for options. ...allocating GPU memory for options. CUDA error at BlackScholes.cu:116 code=46(cudaErrorDevicesUnavailable) "cudaMalloc((void **)&d_CallResult, OPT_SZ)"

Below is strace log. BlackScholes_cuda_error_log.zip

liamhan0905 commented 3 years ago

I also have tensorflow-gpu on WSL2. But I'm getting the error message as shown below.

RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable

Following this link resolved the issue for me! It seems like my issue was also the Windows10 Insider Previews... smh. Simply following "Roll Back Soon After Enabling Insider Previews" section solved it for me (current version: 10.0.20221 Build 20221) and now I can train my model again using tensorflow-gpu. Thank you everyone for the help!

onomatopellan commented 3 years ago
  • Windows 10 Version 2004 (Build 19041.546)

@strarsis In your case you need to use a Windows Insider build from the Dev Channel (build >=20150). CUDA in WSL2 won't work in build 19041.

strarsis commented 3 years ago

@onomatopellan: How long do have I to wait to get this support in stable Windows 10?

onomatopellan commented 3 years ago

@strarsis This is expected for 21H1 aka April 2021.

strarsis commented 3 years ago

@onomatopellan: To use this now, I have to register for Windows Insider, download the ISO - or can I use the Windows updater? Any downsides to using Windows Insider version like performance or stabilitiy?

Meeka33 commented 3 years ago

This stopped working for me as well. winver 2004 20226 with CUDA. It previously was working until yesterday on previous builds. When will this be fixed? Too many recurring bugs, ready to dump windows

strarsis commented 3 years ago

So even with the latest Windows Insider build CUDA in WSL 2 would currently fail, right?

jenatali commented 3 years ago

Can you report whether ldconfig -p includes the CUDA and libdxcore binaries? E.g.:

ldconfig -p | grep dxcore
        libdxcore.so (libc6,x86-64) => /usr/lib/wsl/lib/libdxcore.so

ldconfig -p | grep cuda
        libcuda.so.1 (libc6,x86-64) => /usr/lib/wsl/lib/libcuda.so.1
mcecchi commented 3 years ago

+1 Same problem

onomatopellan commented 3 years ago

@onomatopellan: To use this now, I have to register for Windows Insider, download the ISO - or can I use the Windows updater?

@strarsis The easy way is update to the Dev Channel via Windows Update. You can try new things and if something is broken you have 10 days to rollback to your present build. The hard way but the one I'm using is dual booting to a vhdx.

Any downsides to using Windows Insider version like performance or stabilitiy?

The Dev channel is living on the edge. Every 1~2 weeks there is a new build that brings new things but can break something. This is why is not recommended to use Dev Insider builds in your daily PC, but options like dual booting and Rollback are actually useful and make it feasible.

So even with the latest Windows Insider build CUDA in WSL 2 would currently fail, right?

It depends of your configuration. For example I don't have this issue in latest 20226 build.

Boristype000 commented 3 years ago

I have the same problem in 20226. My build also contains same 8 files in lxss\lib. But I get cudaErrorDevicesUnavailable. Is there a way to roll back 20221? Using "Go back to previous version of Windows 10" sends me to 19041.508.

Yes, you can install 20221 from https://www.microsoft.com/en-us/software-download/windowsinsiderpreviewadvanced

Could only find version 20201 on this page.

noofaq commented 3 years ago

Can you report whether ldconfig -p includes the CUDA and libdxcore binaries? E.g.:

ldconfig -p | grep dxcore
        libdxcore.so (libc6,x86-64) => /usr/lib/wsl/lib/libdxcore.so

ldconfig -p | grep cuda
        libcuda.so.1 (libc6,x86-64) => /usr/lib/wsl/lib/libcuda.so.1

@jenatali

$ ldconfig -p | grep cuda
        libicudata.so.66 (libc6,x86-64) => /lib/x86_64-linux-gnu/libicudata.so.66
        libcuda.so.1 (libc6,x86-64) => /usr/lib/wsl/lib/libcuda.so.1
$ ldconfig -p | grep dxcore
        libdxcore.so (libc6,x86-64) => /usr/lib/wsl/lib/libdxcore.so
bbongcol commented 3 years ago

Can you report whether ldconfig -p includes the CUDA and libdxcore binaries? E.g.:

ldconfig -p | grep dxcore
        libdxcore.so (libc6,x86-64) => /usr/lib/wsl/lib/libdxcore.so

ldconfig -p | grep cuda
        libcuda.so.1 (libc6,x86-64) => /usr/lib/wsl/lib/libcuda.so.1

$ldconfig -p | grep cuda libicudata.so.60 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libicudata.so.60 libcuda.so.1 (libc6,x86-64) => /usr/lib/wsl/lib/libcuda.so.1

$ldconfig -p | grep dxcore libdxcore.so (libc6,x86-64) => /usr/lib/wsl/lib/libdxcore.so

benhillis commented 3 years ago

We have identified the issue and are working on a fix.

tadam98 commented 3 years ago

I had the same problem. Simply, as advised, I went to settings/recovery and went back to previous version which is 20221 and it is working. (until I got to this wonderful post, I have completely removed all of NVIDIA software, reinstalled cuda and cudnn and ... nothing.

So thank you very much for the workaround.

laiviet commented 3 years ago

I have the same problem with Build 20226 Ubuntu 18.04 Cuda toolkit 11-0

laiviet commented 3 years ago

I had the same problem. Simply, as advised, I went to settings/recovery and went back to previous version which is 20221 and it is working. (until I got to this wonderful post, I have completely removed all of NVIDIA software, reinstalled cuda and cudnn and ... nothing.

So thank you very much for the workaround.

I join the Insider Program and jump directly to build 20226, How can I downgrade to a specific build if I havent installed before.

Thanks

tadam98 commented 3 years ago

Search for "here" above in the post, select your os and download the iso from Microsoft. And run it.

tadam98 commented 3 years ago

By the way, Nvidia's "default" install is CUDA 11.1 and cuDNN 8.0 most software requires earlier versions that do work perfectly well.

For example, there is no mxnet for cuda 11.x.

Here is how you install a previous version: Start here: (do not fall into the trap of installation for Ubuntu. Crack the cryptic installation for WSL !) https://docs.nvidia.com/cuda/cuda-quick-start-guide/index.html#wsl-x86_64-rpm

Note: if for some reason your apt-geu update/upgade/dist-upgrade cause cuda 11.x to be installed over your previous version, simply run the three dpkg -i on cuda, cudnn and cudnn-dev and your version will be back on line. Usually nothing eles is needed.

adamfarquhar commented 3 years ago

@kivancguckiran I just joined the insider build so I'm in the same boat. It would probably take like 4 hours, but you could probably revert windows to the previous version (non-insider) maybe then go specifically to 20221. I'm not going to try it and just wait though.

I went through the process to back off to the previous version. While I didn't time it, it took more like 15 minutes than 4 hours. The details will, of course, vary with your setup, time to create a backup, and so on.

lminer commented 3 years ago

Is this fixed in 20231?

cktlco commented 3 years ago

I confirm the same issue exists in 20231.1000

tadam98 commented 3 years ago

I froze my automatic update for 7 days. NVIDIA just released driver 456.71 I keep my 460.20.

gpict commented 3 years ago

Can confirm that this is currently still happening in the most recent version -- 20231

tadam98 commented 3 years ago

It is working with 20221 but when testing the nvidia-docker-2 I get this error:

docker: Error response from daemon: could not select device driver “” with capabilities: [[gpu]].
ERRO[0084] error waiting for container: context canceled

Can anyone advise of getting the nvidia container to run ? I followed these links: https://medium.com/@dalgibbard/docker-with-gpu-support-in-wsl2-ebbc94251cf5 https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

blackliner commented 3 years ago

Maybe you went back too far, please double check your winver.

tadam98 commented 3 years ago

image