Open jseba opened 1 year ago
The --nvproxy-docker
flag semantics is a bit confusing... TLDR; it is needed sometimes in non-Docker environments too. Here is a brief summary of what's going on.
The NVIDIA GPU Container Stack, which is mostly packaged in nvidia-container-toolkit is composed of the following components that are relevant to us:
nvidia-runtime
.nvidia-hook
.nvidia-cli
.Nvidia requires using nvidia-runtime
instead of runc (or runsc or whatever) when running CUDA containers. This OCI compatible runtime acts as a shim and modifies the container spec based on various settings. In legacy mode, it adds the nvidia-hook
as a prestart hook. In csv mode, devices and mounts are injected into the container. In auto mode, the runtime employs heuristics to determine which mode to use.
After modifying the spec, nvidia-runtime
invokes the lower level container runtime (like runc or runsc). The lower-level runtime to use can be configured using the runtimes option in /etc/nvidia-container-runtime/config.toml
. It defaults to runc naturally.
When nvidia-hook
is invoked, it calls nvidia-cli
which configures the container filesystem with devices and libraries needed for the GPU application.
Now about Docker... Note that Docker also inserts nvidia-hook
when using docker run --gpus
. So docker can be used without nvidia-runtime
.
Here's the problem:
In gVisor, we unconditionally skip the nvidia-hook
(irrespective of whether it was set by docker or nvidia-runtime
). When --nvproxy-docker
is enabled, we unconditionally emulate the nvidia-hook
(irrespective of whether it was present in the OCI spec in the first place or not).
So when you are not using docker, you will still want to use --nvproxy-docker
flag if you want to go the nvidia-hook route. The other route is the CSV mode mentioned above. In that case, use nvidia-runtime
(instead of runsc) and configure it to call runsc (as described here). Don't specify --nvproxy-docker
for runsc in this case.
I am looking to fix this issue soon by deprecating the --nvproxy-docker
flag and having --nvproxy
work in all environments.
@ayushr2 to piggyback on your comment. I am trying to run gVisor + NVIDIA in Kubernetes (using containerd). I've somewhat got it running but I'm running into a few issues that you might be able to help me with. I've configured gVisor to use just nvproxy=true
and then mount the NVIDIA driver install location from the host node over to the container. If I create a pod with a nvidia/cuda
container and using the gVisor runtime class and the above configuration, I am able to use nvidia-smi
and access the GPU under gVisor. However, in every in any other image, I cannot interact with the GPU:
$ export LD_LIBRARY_PATH=/usr/local/nvidia/lib64
$ export PATH=/usr/local/nvidia/bin:$PATH
$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I've also tried adding the nvproxy-docker=true
config to gVisor per your comment, and though the library files/binaries seem to be loaded correctly, I still get the exact same error. Do you have any clue on what I might be missing here? Many thanks!
HI @PedroRibeiro95, Hmm I don't think --nvproxy-docker
will help in this case. That flag will try running the nvidia-hook
as a prestart hook. I don't think in k8s environment, the container is provisioned with GPU access using the prestart hook. Lets move this conversation back to #9368.
I'm poking at this problem as well! I'm aware that gVisor doesn't support the k8s-device-plugin currently, but I've been wondering if it's possible in the current state to just get access to all the GPUs on a host (similarly to how running with nvidia-container-runtime
would).
I've been poking at https://gvisor.dev/docs/user_guide/gpu/ a bit, but it's unclear how I'd make nvidia-container-runtime
provide the runsc
flag. The runtimes
config for the NVIDIA runtime seems to just take executables, so runsc
would be the natural configuration there. That'd though also mean that the runtime is unaware of whatever settings I'd provide via containerd
(where I could enable nvproxy
).
@ayushr2 am I missing/confusing something here? Is this setup doomed to fail from the get-go?
but it's unclear how I'd make nvidia-container-runtime provide the runsc flag. The runtimes config for the NVIDIA runtime seems to just take executables, so runsc would be the natural configuration there.
We hit the same issue with Podman, which similarly takes an executable path. You can see how we solve that problem in our Podman tests: https://github.com/google/gvisor/blob/74b82d9a30628b2d98eb9c6aa33f669a87a4fde4/test/podman/run.sh#L31C10-L35
We create a bash executable which exec
s runsc with the flags we want.
RE: Using nvidia-container-runtime
in k8s environment: I haven't tried this. IIUC, nvidia-container-runtime
and k8s-device-plugin
are aiming to do the same thing: 1) expose GPUs to the container 2) prepare container filesystem with NVIDIA libraries. But they mechanisms of doing these are different. I haven't studied k8s-device-plugin
, but https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu does the following:
spec.Linux.Devices
list./home/kubernetes/bin/nvidia
to container path /usr/local/nvidia
.nvidia-container-runtime
on the other hand directly populates the container rootfs directory. It creates bind mounts of GPU devices into container /dev/nvidia*
path. It doesn't modify the OCI spec AFAICT.
🤦 I could've guessed that! Thanks for calling that out and adding it to the documentation. Unfortunately, things don't "just" work and I'm a little at a loss as to where to start poking, tbh.
I've added runsc-gpu
(a script as proposed) to the nvidia-container-toolkit config and I've overridden its mode from auto
to legacy
. The resulting container gets stuck in ContainerCreating
. The only material issue I was able to find in the debug logs (see attached) are dev gofer client not found in context
errors. One of the gofers also failed with sock read failed, closing connection: EOF
.
Disclaimer: This is running against an A30 GPU, which is not officially supported. I was hoping that "not supported" means "not tested, but might work" 😅 . If that's not the case, I'd be happy to help to enable support for A30 GPUs as well.
Yeah that is strange. If you see the boot logs, the StartSubcontainer RPC has the device gofer configuration set (search for IsDevIoFilePresent:true
in logs). The gofer logs also have the following line:
I0611 07:54:06.770170 1 gofer.go:329] Serving /dev mapped on FD 14 (ro: false)
Which confirms that the gofer is serving the device connection.
The error, like you pointed out, is failed to create device files: dev gofer client not found in context
. It is coming from here: https://github.com/google/gvisor/blob/2c5c7869d9ad1d4f4c5d1f3510c1a57a3baecc02/runsc/boot/vfs.go#L1323-L1326
Which means that the dev gofer connection was not added to the kernel via Kernel.AddDevGofer()
. However, from inspecting the code, it seems like the connection should be added if IsDevIoFilePresent = true
over here: https://github.com/google/gvisor/blob/2c5c7869d9ad1d4f4c5d1f3510c1a57a3baecc02/runsc/boot/vfs.go#L783-L789
containerMounter.prepareMounts()
is called before vfs.createDeviceFiles()
. Maybe added some debugging logs in Kernel.AddDevGofer()
, Kernel.RemoveDevGofer()
and Kernel.GetDevGoferClient()
to see what's going on. Maybe we are messing up the container name?
Description
Opening this mostly to avoid spending too much time reverse engineering Docker, runc, nvidia-container-runtime and gVisor behaviors. 😄
Since we don't use Docker to run our containers, figuring out how runsc and nvidia-container-runtime hooks interact is a bit of a challenge. I've read through the nvproxy sandbox setup code, which got me the basic
nvidia-smi
tool to work with our runtime by brute-forcing the OCI config and the--nvproxy
and--nvproxy-docker
flags, but I've also managed to busy loop the gofer at one point somehow (server load over 3000) trying to just start one of our inferencing workloads.Would there be any resources you would be willing to share about how this works under the hood? All the documentation I can find is mostly just "how to configure Docker/Kubernetes to expose GPUs", which doesn't go into those details.
Is this feature related to a specific bug?
No response
Do you have a specific solution in mind?
No response