Running nvproxy containers without Docker

jseba commented 1 year ago

Description

Opening this mostly to avoid spending too much time reverse engineering Docker, runc, nvidia-container-runtime and gVisor behaviors. 😄

Since we don't use Docker to run our containers, figuring out how runsc and nvidia-container-runtime hooks interact is a bit of a challenge. I've read through the nvproxy sandbox setup code, which got me the basic nvidia-smi tool to work with our runtime by brute-forcing the OCI config and the --nvproxy and --nvproxy-docker flags, but I've also managed to busy loop the gofer at one point somehow (server load over 3000) trying to just start one of our inferencing workloads.

Would there be any resources you would be willing to share about how this works under the hood? All the documentation I can find is mostly just "how to configure Docker/Kubernetes to expose GPUs", which doesn't go into those details.

Is this feature related to a specific bug?

No response

Do you have a specific solution in mind?

No response

ayushr2 commented 1 year ago

The --nvproxy-docker flag semantics is a bit confusing... TLDR; it is needed sometimes in non-Docker environments too. Here is a brief summary of what's going on.

The NVIDIA GPU Container Stack, which is mostly packaged in nvidia-container-toolkit is composed of the following components that are relevant to us:

nvidia-container-runtime, hence referred to as nvidia-runtime.
nvidia-container-runtime-hook, hence referred to as nvidia-hook.
nvidia-container-cli, hence referred to as nvidia-cli.

Nvidia requires using nvidia-runtime instead of runc (or runsc or whatever) when running CUDA containers. This OCI compatible runtime acts as a shim and modifies the container spec based on various settings. In legacy mode, it adds the nvidia-hook as a prestart hook. In csv mode, devices and mounts are injected into the container. In auto mode, the runtime employs heuristics to determine which mode to use.

After modifying the spec, nvidia-runtime invokes the lower level container runtime (like runc or runsc). The lower-level runtime to use can be configured using the runtimes option in /etc/nvidia-container-runtime/config.toml. It defaults to runc naturally.

When nvidia-hook is invoked, it calls nvidia-cli which configures the container filesystem with devices and libraries needed for the GPU application.

Now about Docker... Note that Docker also inserts nvidia-hook when using docker run --gpus. So docker can be used without nvidia-runtime.

Here's the problem:
In gVisor, we unconditionally skip the nvidia-hook (irrespective of whether it was set by docker or nvidia-runtime). When --nvproxy-docker is enabled, we unconditionally emulate the nvidia-hook (irrespective of whether it was present in the OCI spec in the first place or not).

So when you are not using docker, you will still want to use --nvproxy-docker flag if you want to go the nvidia-hook route. The other route is the CSV mode mentioned above. In that case, use nvidia-runtime (instead of runsc) and configure it to call runsc (as described here). Don't specify --nvproxy-docker for runsc in this case.

I am looking to fix this issue soon by deprecating the --nvproxy-docker flag and having --nvproxy work in all environments.

PedroRibeiro95 commented 11 months ago

@ayushr2 to piggyback on your comment. I am trying to run gVisor + NVIDIA in Kubernetes (using containerd). I've somewhat got it running but I'm running into a few issues that you might be able to help me with. I've configured gVisor to use just nvproxy=true and then mount the NVIDIA driver install location from the host node over to the container. If I create a pod with a nvidia/cuda container and using the gVisor runtime class and the above configuration, I am able to use nvidia-smi and access the GPU under gVisor. However, in every in any other image, I cannot interact with the GPU:

$ export LD_LIBRARY_PATH=/usr/local/nvidia/lib64
$ export PATH=/usr/local/nvidia/bin:$PATH
$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I've also tried adding the nvproxy-docker=true config to gVisor per your comment, and though the library files/binaries seem to be loaded correctly, I still get the exact same error. Do you have any clue on what I might be missing here? Many thanks!

ayushr2 commented 11 months ago

HI @PedroRibeiro95, Hmm I don't think --nvproxy-docker will help in this case. That flag will try running the nvidia-hook as a prestart hook. I don't think in k8s environment, the container is provisioned with GPU access using the prestart hook. Lets move this conversation back to #9368.

markusthoemmes commented 3 months ago

I'm poking at this problem as well! I'm aware that gVisor doesn't support the k8s-device-plugin currently, but I've been wondering if it's possible in the current state to just get access to all the GPUs on a host (similarly to how running with nvidia-container-runtime would).

I've been poking at https://gvisor.dev/docs/user_guide/gpu/ a bit, but it's unclear how I'd make nvidia-container-runtime provide the runsc flag. The runtimes config for the NVIDIA runtime seems to just take executables, so runsc would be the natural configuration there. That'd though also mean that the runtime is unaware of whatever settings I'd provide via containerd (where I could enable nvproxy).

@ayushr2 am I missing/confusing something here? Is this setup doomed to fail from the get-go?

ayushr2 commented 3 months ago

but it's unclear how I'd make nvidia-container-runtime provide the runsc flag. The runtimes config for the NVIDIA runtime seems to just take executables, so runsc would be the natural configuration there.

We hit the same issue with Podman, which similarly takes an executable path. You can see how we solve that problem in our Podman tests: https://github.com/google/gvisor/blob/74b82d9a30628b2d98eb9c6aa33f669a87a4fde4/test/podman/run.sh#L31C10-L35

We create a bash executable which execs runsc with the flags we want.

RE: Using nvidia-container-runtime in k8s environment: I haven't tried this. IIUC, nvidia-container-runtime and k8s-device-plugin are aiming to do the same thing: 1) expose GPUs to the container 2) prepare container filesystem with NVIDIA libraries. But they mechanisms of doing these are different. I haven't studied k8s-device-plugin, but https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu does the following:

Exposes the GPU devices via spec.Linux.Devices list.
Bind mount host /home/kubernetes/bin/nvidia to container path /usr/local/nvidia.

nvidia-container-runtime on the other hand directly populates the container rootfs directory. It creates bind mounts of GPU devices into container /dev/nvidia* path. It doesn't modify the OCI spec AFAICT.

markusthoemmes commented 3 months ago

🤦 I could've guessed that! Thanks for calling that out and adding it to the documentation. Unfortunately, things don't "just" work and I'm a little at a loss as to where to start poking, tbh.

I've added runsc-gpu (a script as proposed) to the nvidia-container-toolkit config and I've overridden its mode from auto to legacy. The resulting container gets stuck in ContainerCreating. The only material issue I was able to find in the debug logs (see attached) are dev gofer client not found in context errors. One of the gofers also failed with sock read failed, closing connection: EOF.

Disclaimer: This is running against an A30 GPU, which is not officially supported. I was hoping that "not supported" means "not tested, but might work" 😅 . If that's not the case, I'd be happy to help to enable support for A30 GPUs as well.

debug.zip

ayushr2 commented 3 months ago

Yeah that is strange. If you see the boot logs, the StartSubcontainer RPC has the device gofer configuration set (search for IsDevIoFilePresent:true in logs). The gofer logs also have the following line:

I0611 07:54:06.770170       1 gofer.go:329] Serving /dev mapped on FD 14 (ro: false)

Which confirms that the gofer is serving the device connection.

The error, like you pointed out, is failed to create device files: dev gofer client not found in context. It is coming from here: https://github.com/google/gvisor/blob/2c5c7869d9ad1d4f4c5d1f3510c1a57a3baecc02/runsc/boot/vfs.go#L1323-L1326

Which means that the dev gofer connection was not added to the kernel via Kernel.AddDevGofer(). However, from inspecting the code, it seems like the connection should be added if IsDevIoFilePresent = true over here: https://github.com/google/gvisor/blob/2c5c7869d9ad1d4f4c5d1f3510c1a57a3baecc02/runsc/boot/vfs.go#L783-L789

containerMounter.prepareMounts() is called before vfs.createDeviceFiles(). Maybe added some debugging logs in Kernel.AddDevGofer(), Kernel.RemoveDevGofer() and Kernel.GetDevGoferClient() to see what's going on. Maybe we are messing up the container name?

google / gvisor