microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.28k stars 813 forks source link

WSL2 & CUDA does not work [v20226] #6014

Closed noofaq closed 3 years ago

noofaq commented 3 years ago

Environment

Windows build number: 10.0.20226.0
Your Distribution version: 18.04 / 20.04
Whether the issue is on WSL 2 and/or WSL 1: Linux version 4.19.128-microsoft-standard (oe-user@oe-host) (gcc version 8.2.0 (GCC)) #1 SMP Tue Jun 23 12:58:10 UTC 2020

Steps to reproduce

Exactly followed instructions available at https://docs.nvidia.com/cuda/wsl-user-guide/index.html Tested on previously working Ubuntu WSL image (IIRC GPU last worked on 20206, than whole WSL2 stopped working) Tested also on newly created Ubuntu 18.04 and Ubuntu 20.04 images.

I have tested CUDA compatible NVIDIA drivers 455.41 & 460.20. I have tried removing all drivers etc. I have also tested using CUDA 10.2 & CUDA 11.0.

It was tested on two separate machines (one Intel + GTX1060, other Ryzen + RTX 2080Ti)

Issue tested directly in OS also in docker containers inside.

Example (directly in Ubuntu):

piotr@DESKTOP-FS6J3NT:/usr/local/cuda/samples/4_Finance/BlackScholes$ ./BlackScholes
[./BlackScholes] - Starting...
GPU Device 0: "Turing" with compute capability 7.5

Initializing data...
...allocating CPU memory for options.
...allocating GPU memory for options.
CUDA error at BlackScholes.cu:116 code=46(cudaErrorDevicesUnavailable) "cudaMalloc((void **)&d_CallResult, OPT_SZ)"

Example in container:

piotr@DESKTOP-FS6J3NT:/mnt/c/Users/pppnn$ docker run -it --gpus all -p 8888:8888 tensorflow/tensorflow:latest-gpu-py3-jupyter python
Python 3.6.9 (default, Nov  7 2019, 10:44:02)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2020-10-01 14:18:07.538627: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-10-01 14:18:07.624188: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
>>> tf.test.is_gpu_available()
WARNING:tensorflow:From <stdin>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2020-10-01 14:18:32.359457: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-10-01 14:18:32.398949: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3200035000 Hz
2020-10-01 14:18:32.402692: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3d06b70 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-01 14:18:32.402748: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-10-01 14:18:32.409370: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-10-01 14:18:32.877228: W tensorflow/compiler/xla/service/platform_util.cc:276] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_UNKNOWN: unknown error
2020-10-01 14:18:32.877370: I tensorflow/compiler/jit/xla_gpu_device.cc:136] Ignoring visible XLA_GPU_JIT device. Device number is 0, reason: Internal: no supported devices found for platform CUDA
2020-10-01 14:18:32.879904: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:1d:00.0/numa_node
Your kernel may have been built without NUMA support.
2020-10-01 14:18:32.880192: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:1d:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.665GHz coreCount: 68 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 573.69GiB/s
2020-10-01 14:18:32.880277: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-10-01 14:18:32.880340: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-10-01 14:18:32.959947: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-10-01 14:18:32.973554: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-10-01 14:18:33.111736: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-10-01 14:18:33.127902: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-10-01 14:18:33.128018: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-10-01 14:18:33.128535: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:1d:00.0/numa_node
Your kernel may have been built without NUMA support.
2020-10-01 14:18:33.129170: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:1d:00.0/numa_node
Your kernel may have been built without NUMA support.
2020-10-01 14:18:33.129403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-10-01 14:18:33.131671: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 1513, in is_gpu_available
    for local_device in device_lib.list_local_devices():
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/device_lib.py", line 43, in list_local_devices
    _convert(s) for s in _pywrap_device_lib.list_devices(serialized_config)
RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable
>>>
>>>
>>>
>>>
>>> tf.config.list_physical_devices('GPU')
2020-10-01 14:18:55.610151: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:1d:00.0/numa_node
Your kernel may have been built without NUMA support.
2020-10-01 14:18:55.610510: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:1d:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.665GHz coreCount: 68 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 573.69GiB/s
2020-10-01 14:18:55.610579: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-10-01 14:18:55.610623: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-10-01 14:18:55.610676: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-10-01 14:18:55.610719: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-10-01 14:18:55.610762: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-10-01 14:18:55.610805: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-10-01 14:18:55.610846: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-10-01 14:18:55.611251: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:1d:00.0/numa_node
Your kernel may have been built without NUMA support.
2020-10-01 14:18:55.611765: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:1d:00.0/numa_node
Your kernel may have been built without NUMA support.
2020-10-01 14:18:55.611999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
>>>
>>>
>>>
>>> tf.test.gpu_device_name()
2020-10-01 14:20:08.762060: W tensorflow/compiler/xla/service/platform_util.cc:276] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_UNKNOWN: unknown error
2020-10-01 14:20:08.762222: I tensorflow/compiler/jit/xla_gpu_device.cc:136] Ignoring visible XLA_GPU_JIT device. Device number is 0, reason: Internal: no supported devices found for platform CUDA
2020-10-01 14:20:08.762863: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:1d:00.0/numa_node
Your kernel may have been built without NUMA support.
2020-10-01 14:20:08.763201: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:1d:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.665GHz coreCount: 68 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 573.69GiB/s
2020-10-01 14:20:08.763263: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-10-01 14:20:08.763316: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-10-01 14:20:08.763358: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-10-01 14:20:08.763379: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-10-01 14:20:08.763428: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-10-01 14:20:08.763480: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-10-01 14:20:08.763533: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-10-01 14:20:08.763898: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:1d:00.0/numa_node
Your kernel may have been built without NUMA support.
2020-10-01 14:20:08.764536: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:1d:00.0/numa_node
Your kernel may have been built without NUMA support.
2020-10-01 14:20:08.764810: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/test_util.py", line 112, in gpu_device_name
    for x in device_lib.list_local_devices():
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/device_lib.py", line 43, in list_local_devices
    _convert(s) for s in _pywrap_device_lib.list_devices(serialized_config)
RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable
>>>

Expected behavior

CUDA working inside WSL2

Actual behavior

All tests which are using CUDA inside WSL Ubuntu are resulting with various CUDA errors - mostly referring to no CUDA devices available.

blackliner commented 3 years ago

Did you reinstall nvidia-docker2 after the rollback?

sudo apt-get update
sudo apt-get install -y --reinstall nvidia-docker2

Oh, and did you sudo service docker start ?

tadam98 commented 3 years ago

It was anyway the first time install of the nvidia-docker2. here is what I did:

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
ubuntu18.04
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
OK
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# Do not skip the experimental
$ curl -s -L https://nvidia.github.io/nvidia-container-runtime/experimental/$distribution/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
$ sudo apt-get update
$ sudo apt-get install -y nvidia-docker2
# restart the docker desktop (WSL2 it is on the PC)
# actually - rebooted the PC.

Here is what I did now:

mickey@MICKEY-2080TI:~$ sudo apt-get update -y
Hit:1 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Ign:2 http://ppa.launchpad.net/videolan/stable-daily/ubuntu bionic InRelease
Hit:3 http://dl.google.com/linux/chrome/deb stable InRelease
Err:4 http://ppa.launchpad.net/videolan/stable-daily/ubuntu bionic Release
  404  Not Found [IP: 91.189.95.83 80]
Hit:5 http://packages.microsoft.com/repos/vscode stable InRelease
Err:6 http://debian.sourcegear.com/ubuntu bionic InRelease
  403  Forbidden [IP: 52.216.186.90 80]
Hit:7 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:8 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Ign:9 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:10 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Get:11 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:13 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:14 https://download.docker.com/linux/ubuntu bionic InRelease
Hit:15 https://nvidia.github.io/libnvidia-container/experimental/ubuntu18.04/amd64  InRelease
Hit:16 https://nvidia.github.io/nvidia-container-runtime/experimental/ubuntu18.04/amd64  InRelease
Hit:17 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64  InRelease
Hit:18 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  InRelease
Hit:19 https://nvidia.github.io/nvidia-docker/ubuntu18.04/amd64  InRelease
Hit:20 https://packages.lunarg.com/vulkan bionic InRelease
Reading package lists... Done
E: The repository 'http://ppa.launchpad.net/videolan/stable-daily/ubuntu bionic Release' does not have a Release file.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
E: Failed to fetch http://debian.sourcegear.com/ubuntu/dists/bionic/InRelease  403  Forbidden [IP: 52.216.186.90 80]
E: The repository 'http://debian.sourcegear.com/ubuntu bionic InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list.d/libnvidia-container-experimental.list:1 and /etc/apt/sources.list.d/nvidia-container-runtime.list:1
W: Target Translations (en) is configured multiple times in /etc/apt/sources.list.d/libnvidia-container-experimental.list:1 and /etc/apt/sources.list.d/nvidia-container-runtime.list:1
W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list.d/libnvidia-container-experimental.list:1 and /etc/apt/sources.list.d/nvidia-container-runtime.list:1
W: Target Translations (en) is configured multiple times in /etc/apt/sources.list.d/libnvidia-container-experimental.list:1 and /etc/apt/sources.list.d/nvidia-container-runtime.list:1
mickey@MICKEY-2080TI:~$ sudo apt-get install -y --reinstall nvidia-docker2
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
  cuda-11-1 cuda-command-line-tools-11-1 cuda-compiler-11-1 cuda-cudart-11-1 cuda-cudart-dev-11-1 cuda-cuobjdump-11-1
  cuda-cupti-11-1 cuda-cupti-dev-11-1 cuda-demo-suite-11-1 cuda-documentation-11-1 cuda-driver-dev-11-1 cuda-gdb-11-1
  cuda-libraries-11-1 cuda-libraries-dev-11-1 cuda-memcheck-11-1 cuda-nsight-11-1 cuda-nsight-compute-11-1
  cuda-nsight-systems-11-1 cuda-nvcc-11-1 cuda-nvdisasm-11-1 cuda-nvml-dev-11-1 cuda-nvprof-11-1 cuda-nvprune-11-1
  cuda-nvrtc-11-1 cuda-nvrtc-dev-11-1 cuda-nvtx-11-1 cuda-nvvp-11-1 cuda-runtime-11-1 cuda-samples-11-1
  cuda-sanitizer-11-1 cuda-toolkit-11-1 cuda-tools-11-1 cuda-visual-tools-11-1 golang-docker-credential-helpers
  libcublas-11-1 libcublas-dev-11-1 libcufft-11-1 libcufft-dev-11-1 libcurand-11-1 libcurand-dev-11-1 libcusolver-11-1
  libcusolver-dev-11-1 libcusparse-11-1 libcusparse-dev-11-1 libnpp-11-1 libnpp-dev-11-1 libnvjpeg-11-1
  libnvjpeg-dev-11-1 nsight-compute-2020.2.0 nsight-systems-2020.3.4 python-backports.ssl-match-hostname
  python-cached-property python-certifi python-chardet python-docker python-dockerpty python-dockerpycreds
  python-docopt python-funcsigs python-functools32 python-jsonschema python-mock python-openssl python-pbr
  python-requests python-texttable python-urllib3 python-websocket python-yaml
mickey@MICKEY-2080TI:~$
0 upgraded, 0 newly installed, 1 reinstalled, 0 to remove and 8 not upgraded.
Need to get 0 B/5912 B of archives.
After this operation, 0 B of additional disk space will be used.
(Reading database ... 205537 files and directories currently installed.)
Preparing to unpack .../nvidia-docker2_2.5.0-1_all.deb ...
Unpacking nvidia-docker2 (2.5.0-1) over (2.5.0-1) ...
Setting up nvidia-docker2 (2.5.0-1) ...
mickey@MICKEY-2080TI:~$ **sudo service docker start**
 * Starting Docker: docker                                                                                              mickey@MICKEY-2080TI:~$ docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

mickey@MICKEY-2080TI:~$ docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled
mickey@MICKEY-2080TI:~$
tadam98 commented 3 years ago

And, checking the gpu under tensorflow works fine (see the end): (blurmvp3.7g) mickey@MICKEY-2080TI:~$ python Python 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information.

import tensorflow as tf /home/mickey/miniconda3/envs/blurmvp3.7g/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/mickey/miniconda3/envs/blurmvp3.7g/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/mickey/miniconda3/envs/blurmvp3.7g/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/mickey/miniconda3/envs/blurmvp3.7g/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/mickey/miniconda3/envs/blurmvp3.7g/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/mickey/miniconda3/envs/blurmvp3.7g/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) /home/mickey/miniconda3/envs/blurmvp3.7g/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/mickey/miniconda3/envs/blurmvp3.7g/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/mickey/miniconda3/envs/blurmvp3.7g/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/mickey/miniconda3/envs/blurmvp3.7g/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/mickey/miniconda3/envs/blurmvp3.7g/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/mickey/miniconda3/envs/blurmvp3.7g/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) tf.version '1.14.0' tf.test.is_gpu_available() 2020-10-08 17:05:37.772456: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2020-10-08 17:05:38.022317: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3493440000 Hz 2020-10-08 17:05:38.034860: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56364a3d9840 executing computations on platform Host. Devices: 2020-10-08 17:05:38.034907: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): , 2020-10-08 17:05:38.064028: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1 2020-10-08 17:05:38.560420: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] could not open file to read NUMA node: /sys/bus/pci/devices/0000:09:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-08 17:05:38.560712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.65 pciBusID: 0000:09:00.0 2020-10-08 17:05:38.577694: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2020-10-08 17:05:39.146697: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2020-10-08 17:05:39.302286: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0 2020-10-08 17:05:39.355665: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0 2020-10-08 17:05:40.012276: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0 2020-10-08 17:05:40.299079: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0 2020-10-08 17:05:41.381062: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2020-10-08 17:05:41.381777: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] could not open file to read NUMA node: /sys/bus/pci/devices/0000:09:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-08 17:05:41.382683: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] could not open file to read NUMA node: /sys/bus/pci/devices/0000:09:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-08 17:05:41.382892: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0 2020-10-08 17:05:41.393734: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2020-10-08 17:05:41.741039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-10-08 17:05:41.741080: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 2020-10-08 17:05:41.741116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N 2020-10-08 17:05:41.742250: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] could not open file to read NUMA node: /sys/bus/pci/devices/0000:09:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-08 17:05:41.742524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1409] Could not identify NUMA node of platform GPU id 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2020-10-08 17:05:41.743173: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] could not open file to read NUMA node: /sys/bus/pci/devices/0000:09:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-08 17:05:41.743881: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] could not open file to read NUMA node: /sys/bus/pci/devices/0000:09:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-08 17:05:41.744115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:0 with 9630 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:09:00.0, compute capability: 7.5) 2020-10-08 17:05:41.758000: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56364e3c7ac0 executing computations on platform CUDA. Devices: 2020-10-08 17:05:41.758038: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5 True

tadam98 commented 3 years ago

I executed a very heavy ML process on this environment and GPU perfect.

tadam98 commented 3 years ago

I executed a very heavy GPU/ML in this environment and it works perfect. My only problem is that the docker complains:

mickey@MICKEY-2080TI:~$ docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled
tadam98 commented 3 years ago

I did - but ... it does not always work:

$ sudo service docker stop
$ sudo service docker start
$ docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
MapSMtoCores for SM 7.5 is undefined.  Default to use 64 Cores/SM
GPU Device 0: "GeForce RTX 2080 Ti" with compute capability 7.5

> Compute 7.5 CUDA device: [GeForce RTX 2080 Ti]
69632 bodies, total time for 10 iterations: 112.230 ms
= 432.026 billion interactions per second
= 8640.519 single-precision GFLOP/s at 20 flops per interaction

This solved it - works every time:

$ sudo service docker stop
$ sudo service docker star
$ sudo mkdir /sys/fs/cgroup/systemd
$ sudo mount -t cgroup -o none,name=systemd cgroup /sys/fs/cgroup/systemd
$ docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
$ docker run -it --gpus all -p 8888:8888 tensorflow/tensorflow:latest-gpu-py3-jupyter

image

FluorineDog commented 3 years ago

Seems this issue has been tagged "fix inbound" for quite a while, but is not even mentioned in the 20231 release note(not even in the known issues part).

Could more attention be paid to this issue? This is the main blocker for me as a CUDA programmer.

lefnire commented 3 years ago

@FluorineDog sweet summer child, getting an MS response at all is an anomaly, and 3 days old is milliseconds (been waiting on https://github.com/microsoft/WSL/issues/4150 for 1yr). I anticipate staying in this rolled back build for 3-6mo. See this for disabling auto update/restart (though not sure if Dev Channel will auto update if you've rolled back). Anyway, definitely do the rollback soon, and don't twiddle your thumbs on a fix.

onomatopellan commented 3 years ago

@FluorineDog "fixinbound" means the fix will appear in an incoming insider build in 2~4 weeks. Once you see "fixedininsiderbuilds" tag will mean the bug is fixed in latest insider build.

mcecchi commented 3 years ago

I hope so, @onomatopellan !

FluorineDog commented 3 years ago

@FluorineDog sweet summer child, getting an MS response at all is an anomaly, and 3 days old is milliseconds (been waiting on #4150 for 1yr). I anticipate staying in this rolled back build for 3-6mo. See this for disabling auto update/restart (though not sure if Dev Channel will auto update if you've rolled back). Anyway, definitely do the rollback soon, and don't twiddle your thumbs on a fix.

Previous builds in the near past trigger another fatal bug, while ancient ones don't even deliver the needed functionality. That's why I'm checking this thread constantly. Once I can run my first CUDA program, I'll freeze my dev channel gladly, but sadly not now.

michelemoretti commented 3 years ago

I have the same problem in 20226. My build also contains same 8 files in lxss\lib. But I get cudaErrorDevicesUnavailable. Is there a way to roll back 20221? Using "Go back to previous version of Windows 10" sends me to 19041.508.

Yes, you can install 20221 from https://www.microsoft.com/en-us/software-download/windowsinsiderpreviewadvanced

I'm trying to downgrade but can't find a way, in the provided link version 20221 is not in the multiselect at the bottom. any tips on how to downgrade?

onomatopellan commented 3 years ago

@michelemoretti They updated latest official ISO with build 20231 only. You can still generate the x64 ISO with the build you like with sites like https://uup.rg-adguard.net or https://uupdump.ml/

tudor commented 3 years ago

+1, happens to me on 20231 as well.

zhangshengsheng commented 3 years ago

Same problem

basarane commented 3 years ago

@michelemoretti They updated latest official ISO with build 20231 only. You can still generate the x64 ISO with the build you like with sites like https://uup.rg-adguard.net or https://uupdump.ml/

I've installed both 20226 and 20231. Later I realized that CUDA failed on WSL2. I cannot revert to 20221, only 20226. Is it safe to install 20221 from the ISO downloaded form these sites on 20226?

tadam98 commented 3 years ago

It used to be here: https://www.microsoft.com/en-us/software-download/windowsinsiderpreviewadvanced But now 20221 is not there any more. I "paused" updates for 7 days hoping that in the mean time microsoft will fix the nvidia problem.

This is the link I used. Unfortunately I did not keep the image. https://software-download.microsoft.com/db/Windows10_InsiderPreview_Client_x64_en-us_20201.iso?t=316defb4-045b-4f87-82cb-e2e201cdca3a&e=1602073124&h=698ebdb68b3a19ab77b28256c9a826b2

manishkm commented 3 years ago

updated to preview build 20231, seems like still this issue is not solved.

onomatopellan commented 3 years ago

@tadam98 That link won't work anymore.

@basarane I found a better place to download official insider build 20201 ISO in https://tb.32767.ga/get.php?id=1727 Just make sure you are logged on Windows Insider site before pressing Confirm button.

tadam98 commented 3 years ago

20221.1000 is the version you want. I have it installed and is working well. It does suffer the reported problem that WSL2 looses internet once in a while (and a reboot is needed). but nvidia-dowcek2 works well on it.

onomatopellan commented 3 years ago

Build 20201 should be a good build stop too. CUDA in WSL2 works since build 20145.

tadam98 commented 3 years ago

That's before my time :) All insider versions are here: https://tb.32767.ga/products.php?prod=win10ip

onomatopellan commented 3 years ago

@tadam98 That's server version only. Official client Insider ISOs are 20201 and 20231. See flight hub.

MoncefYabi commented 3 years ago

Same issue for me on Build 20231, WSL2 and GTX 1650:

root@LAPTOP:/usr/local/cuda/samples/4_Finance/BlackScholes# ./BlackScholes
[./BlackScholes] - Starting...
GPU Device 0: "Turing" with compute capability 7.5

Initializing data...
...allocating CPU memory for options.
...allocating GPU memory for options.
CUDA error at BlackScholes.cu:116 code=46(cudaErrorDevicesUnavailable) "cudaMalloc((void **)&d_CallResult, OPT_SZ)"

No GPU is listed after issuing the command lspci on Ubuntu 18.04:

978b:00:00.0 3D controller: Microsoft Corporation Device 008e
ab50:00:00.0 SCSI storage controller: Red Hat, Inc. Virtio filesystem (rev 01)
abff:00:00.0 SCSI storage controller: Red Hat, Inc. Virtio filesystem (rev 01)
ae20:00:00.0 3D controller: Microsoft Corporation Device 008e
bf2d:00:00.0 SCSI storage controller: Red Hat, Inc. Virtio filesystem (rev 01)

The Tensorflow function list_local_devices returns:

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 13820027289368611552
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 11825873724132199309
physical_device_desc: "device: XLA_CPU device"
]
Lokustae commented 3 years ago

It's three days I continue trying different solutions but seems there's none apart from downgrading... I get the same error as the above posts either with cuda-toolkit-11-0 and cuda toolkit-11-1 with my small nvidia GeForce MX150. The error constantly thrown up is (for the .BlackScholes example given in the CUDA on WSL guide https://docs.nvidia.com/cuda/wsl-user-guide/index.html) is: CUDA error at BlackScholes.cu:116 code=46(cudaErrorDevicesUnavailable) "cudaMalloc((void **)&d_CallResult, OPT_SZ)" I bookmarked this page and following updates daily!

OkuyanBoga commented 3 years ago

Same issue here. If you use the Build 20231, I don't suggest to downgrade because it introduces a new error about user accounts. If you want to try, please consider backup.

Agrover112 commented 3 years ago

Same issue here on the 20231 version , this issue needs to be fixed soon!

mitchellvitez commented 3 years ago

Piling on. When I run this python code on build 20231 (WSL2, ubuntu 20.04, RTX 2080 Super, nvidia driver 460.20) I get the all CUDA-capable devices are busy or unavailable error.

import torch
torch.rand(500,500,500).cuda()

Going back to build 20201 fixed this issue.

Agrover112 commented 3 years ago

@mitchellvitez I really don't think going back to 20201 is safe move considering previous bugs in 20201 as well

Agrover112 commented 3 years ago

I hope this is fixed soon enough

ccs96307 commented 3 years ago

I downgraded the version to 20221, the WSL 2 I installed showed the error message โ€œremote procedure call failedโ€.

Agrover112 commented 3 years ago

@ccs96307 Now what can we say phew.. solving 1 problem leads to another

jpapon commented 3 years ago

Same problem on 20231. Cuda samples error out, for example matrixMul:

[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "Turing" with compute capability 7.5

MatrixA(320,320), MatrixB(640,320)
CUDA error at matrixMul.cu:130 code=46(cudaErrorDevicesUnavailable) "cudaMallocHost(&h_A, mem_size_A)"
tommywu052 commented 3 years ago

Build 20201 should be a good build stop too. CUDA in WSL2 works since build 20145.

How can you work on Build 20201 since 20201 will have "remote procedure call error " when you start up the WSL2 ?

tadam98 commented 3 years ago

20221 works well for me. Wsl2 looses network once in a while but I can live with it.

I paused the updates for 7 days.

Kab1r commented 3 years ago

Build 20231 doesn't work for me

onomatopellan commented 3 years ago

How can you work on Build 20201 since 20201 will have "remote procedure call error " when you start up the WSL2 ?

@tommywu052 Wasn't 20211 the build that had that issue #5907? Anyway is hard to say a build where everything works for everyone. In my case I never had that issue nor this thread's issue.

tadam98 commented 3 years ago

I did not encounter #5907 may because I use ubuntu 18.04.

tadam98 commented 3 years ago

I am successfully working with 20221.1000.

sciafri commented 3 years ago

I'm also having this problem on 20231 WSL Ubuntu 20.04. Just in case anyone wants to save time by not trying this as I just did try installing as a separate distribution Ubuntu 18.04, installed CUDA and rebuilt out from source the examples, and attempted to run BlackScholes. It seems that doesn't make a difference - same error.

Since I just moved to the dev channel today for CUDA support, I don't have the luxury of rolling back. I also have been dealing with a problem with CUDA on my Linux install, so I was hoping this would be worth the effort (guess not). Hope this is fixed soon.... it seems installing CUDA is an absolute crap experience always.

wanfuse123 commented 3 years ago

I have the same bleeping problem running cuda

docker run --runtime=nvidia --rm -ti -v "${PWD}:/app" nricklin/ubuntu-gpu-test modprobe: ERROR: ../libkmod/libkmod.c:556 kmod_search_moddep() could not open moddep file '/lib/modules/4.19.128-microsoft-standard/modules.dep.bin' test.cu(29) : cudaSafeCall() Runtime API error : no CUDA-capable device is detected.

wanfuse123 commented 3 years ago

I wonder if the module problem is causing the CUDA-capable error

From what I understand you cant run the headers in the container, does MS provide these headers in some other fashion. Seems to be broken for me for at least a week (I did not notice)

Cant roll back to 20221.1000 to test unfortunately, hope this is fixed soon...

blackliner commented 3 years ago

231.1005 is available now, will it fix this problem? https://blogs.windows.com/windows-insider/2020/10/07/announcing-windows-10-insider-preview-build-20231/

According to the page seems to not fix anything

kuseo commented 3 years ago

231.1005 is available now, will it fix this problem? https://blogs.windows.com/windows-insider/2020/10/07/announcing-windows-10-insider-preview-build-20231/

According to the page seems to not fix anything

No, 20231.1005 still doesn't work

OkuyanBoga commented 3 years ago

They mentioned this problem: "CUDA on WSL Guide by Nvidia" Currently only suggested solution is reverting back to 20221 build. Is there any safe and easy way or at least any guide to do it?

jamespacileo commented 3 years ago

Ok, so what is the safest way to revert to build 20221?

It's not available via the official methods: reverting to the previous build or through advanced options in the insiders panel.

Do we need to install through a third party ISO? If so which ones are safe?

Also, have we had any official comment from a WSL maintainer?

Thanks ๐Ÿ‘

theothings commented 3 years ago

Is there any workaround since the build 20221 image is no longer available here?

Or any ideas when this will likely be fixed? ๐Ÿ‘

Agrover112 commented 3 years ago

You can find the ISO for 2021 here check in the comments @theothings @jamespacileo

https://forums.developer.nvidia.com/t/code-46-error-device-unreachable/156739

Here in the comments you will find link(unofficial) because 2021 was unavailable on Windows Insiders Downloads menu(shows later version for downloads)!

kuseo commented 3 years ago

Is there any workaround since the build 20221 image is no longer available here?

Or any ideas when this will likely be fixed? ๐Ÿ‘

you can downlad and create previous build version of windows iso including 20221 at https://uupdump.ml

wanfuse123 commented 3 years ago

Wont installing 20021 through boot and reinstall destroy installations of Ubuntu on WSL and all the configurations of nvidia, and docker, I know you can reinstall and preserve applications but are these included?