siderolabs / extensions

Talos Linux System Extensions
89 stars 87 forks source link

[nvidia-container-toolkit] Allow customizing nvidia-container-runtime.toml #399

Open nevivurn opened 1 month ago

nevivurn commented 1 month ago

It would be useful if it were possible to customize nvidia-container-runtime.toml without having to build new build assets.

We are using nvidia GPUs in our cluster, and we want to prevent users from accessing all GPUs on a system by setting NVIDIA_VISIBLE_DEVICES=all, instead requiring proper resource requests & quotas.

nvidia does provide a way to do this, as documented here by setting

accept-nvidia-visible-devices-envvar-when-unprivileged = false
accept-nvidia-visible-devices-as-volume-mounts = true

in nvidia-container-runtime.toml.

Currently, there does not seem to be a way to do this without building the extensions and boot assets.

frezbo commented 3 weeks ago

If this has no user visible change, we could even make this the default.

nevivurn commented 3 weeks ago

This does have user-visible change. If the user runs a container that sets NVIDIA_VISIBLE_DEVICES=all and does not specify requests or limits, previously they would have access to every GPU on the node, while with the above config they would see none.