hostRequirements: gpu: optional is broken on Windows 11 and 10

sarphiv commented 7 months ago

VSCode Version: >=1.84.2
Local OS Version: Multiple OS
Remote OS Version: ?
Remote Extension/Connection Type: Containers and WSL

Logs: N/A

Does this issue occur when you try this locally?: Yes Does this issue occur when you try this locally and all extensions are disabled?: Yes

This issue is a continuation of #9220, which appears to have regressed recently. Read the previous issue for more context.

Steps to Reproduce:

Setup Docker to support CUDA containers according to NVIDIA's official instructions
Create devcontainer.json with "hostRequirements": { "gpu": "optional" }
Open a devcontainer that is supposed to support CUDA with the above config
Check for CUDA support in PyTorch, or by running nvidia-smi

On Linux Fedora 38 the above works - the container has access to the GPU. On Windows 11 + WSL2 the above does not work. Troubleshooting steps have been described in #9220.

Adding "runArgs": [ "--gpus", "all" ] to devcontainer.json makes Windows 11 + WSL2 work. However, using the runArgs trick breaks the devcontainer for machines without GPUs (confirmed on Windows 11, macOS, and Linux Fedora).

As a temporary workaround, we are therefore currently maintaining two files: .devcontainer/gpu/devcontainer.json and .devcontainer/cpu/devcontainer.json.

chrmarti commented 7 months ago

What do you get for running docker info -f '{{.Runtimes.nvidia}}' on the command line?

sarphiv commented 7 months ago

What do you get for running docker info -f '{{.Runtimes.nvidia}}' on the command line? @chrmarti

The team member who experienced the issues on Windows 11 + WSL2 is currently on leave.

However, I found a Windows 10 machine with a GPU that has never had anything Docker nor NVIDIA container related installed on it. I installed Docker Desktop with WSL2 support, and oddly enough GPU passthrough appears to be supported by default, so I did nothing further.

Anyways, I ran your command and it gave:

> docker info -f '{{.Runtimes.nvidia}}'
'<no value>'

I guess your suspicion from the previous issue was correct.

To ensure that this machine was also affected by the bug I created a folder with the following contents. Note that I just took some existing files and started deleting things, so there's probably some unrelated lines in the following:

.devcontainer/devcontainer.json


{
"name": "Dockerfile devcontainer gpu",
"build": {
"context": "..",
"dockerfile": "Dockerfile"
},
"workspaceFolder": "/workspace",
"workspaceMount": "source=.,target=/workspace,type=bind",
"hostRequirements": {
"gpu": "optional"
},
"runArgs": [
"--shm-size=4gb",
"--gpus=all"
]
}

> .devcontainer/Dockerfile
```dockerfile
# Setup environment basics
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime

# Install packages
RUN apt update -y \
    && apt install -y sudo \
    && apt clean

# Set up user
ARG USERNAME=user
ARG USER_UID=1000
ARG USER_GID=$USER_UID

RUN groupadd --gid $USER_GID $USERNAME \
    && useradd --uid $USER_UID --gid $USER_GID -m $USERNAME \
    && echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME \
    && chmod 0440 /etc/sudoers.d/$USERNAME

USER $USERNAME

# Set up working directory
WORKDIR /workspace

# Set up environment variables
ENV PYTHONUNBUFFERED=True

Then I rebuilt and reopened the folder in a devcontainer via VSCode, and ran the following command to confirm I had access to a GPU (I also separately ensured PyTorch had access to CUDA acceleration):

> nvidia-smi

Everything worked perfectly. Afterwards, I commented out the "runArgs" key from the devcontainer.json file and repeated the above. This time nvidia-smi did not work and PyTorch had no CUDA acceleration.

chrmarti commented 7 months ago

Great, what do you get for docker info -f '{{json .}}' on that machine? Thanks.

sarphiv commented 7 months ago

I'm assuming you meant docker info -f json, because the other command fails. Here's the output.json. I sadly don't see any GPU nor NVIDIA references.

I also checked docker info -f '{{.Runtimes.nvidia}}' on Linux Fedora. It has an output which contains the string "nvidia-container-runtime", so I guess that's why it works on Linux. I then checked docker info -f json on Linux too, and it does contain the runtime nvidia, so I guess Window's being weird.

chrmarti commented 7 months ago

We could add a machine-scoped setting to tell us if a GPU is present, absent or (the default like today) should be detected. That will give users a good out-of-the-box experience where the detection works, others can use the setting and we can gradually (where possible) improve the detection.

sidecus commented 5 months ago

I am running into the same issue on my Windows machine. nvidia-smi -L correctly returns the GPU info. docker info doesn't return anything related to the GPU.

Shall we use nvidia-smi to detect NVidia GPU instead?

sarphiv commented 5 months ago

I am running into the same issue on my Windows machine. nvidia-smi -L correctly returns the GPU info. docker info doesn't return anything related to the GPU.

Shall we use nvidia-smi to detect NVidia GPU instead?

If we only used nvidia-smi then maybe this would fail on Linux, where you may have NVIDIA drivers (nvidia-smi works) but not the NVIDIA Container Runtime (no GPU inside containers).

sangotaro commented 3 months ago

@chrmarti I am using an Ubuntu 22.04 machine with an NVIDIA GPU (non-WSL), but the hostRequirements: gpu: optional is not working. The output of docker info -f '{{.Runtimes.nvidia}}' is <no value>, indicating that I am experiencing the same issue as in this case. The output of docker info is as follows:

docker-info.json

pascal456 commented 1 month ago

Stumbled upon this in the last days again, after having a solution in #9220 in January.

Working on a Windows Workstation now and cannot get a Dev Container running via WSL with GPU support.

What is about the intermediary solution to have a machine specific configuration, which marti mentioned above?

chrmarti commented 1 month ago

I agree, that whether or not an SSH server machine can use its GPU in a Docker container should be a setting on the SSH server machine. It doesn't belong to the local machine.

One difficulty with the machine setting is that when connecting through an SSH server (or Tunnel), we can't access its machine settings through VS Code's API because that only knowns the local and the dev container (calling these "machine settings") settings. We can check for and read the machine settings.json in the extension though. /cc @sandy081

RaphaelMelanconAtBentley commented 6 days ago

Here is my hacky fix for docker compose in the meantime :) https://github.com/microsoft/vscode-remote-release/issues/10124#issuecomment-2304669818

microsoft / vscode-remote-release

hostRequirements: gpu: optional is broken on Windows 11 and 10 #9385