Open achristianson opened 4 months ago
You probably need to open an issue with NVIDIA gpu-operator
, as the problem is in the driver detection phase?
As Talos Linux has drivers installed correctly, I'm not sure what else can we do.
You probably need to open an issue with NVIDIA
gpu-operator
, as the problem is in the driver detection phase?
The problem is they consider talos to be an unsupported platform, so they won't put resources into making gpu-operator work with talos.
Something we could do in talos is make sure the right files are in the right place such that gpu-operator just works (which may be a fairly trivial change in talos).
Otherwise, the current GPU support in talos is limited to more basic GPU workloads which do not need multi GPU, topology optimization, mixed GPU features, time slicing, etc.
It runs in a container, so unless it tries to escape to the PID 1 user namespace, you can pre-create any files it might need in the init container (assuming /run
is within the container, not a host mount).
But I believe people are using all advanced GPU workloads with Talos without any issues (not sure what exactly is the path there).
I believe you're right that it can be done with enough init containers and custom talos configs. Given gpu-operator
is one of the primary ways people enable GPU workloads in clusters with features beyond the drivers and toolkit, it would be nice if it just worked in talos.
This bug report/feature request should probably be renamed/reconsidered to be: "put the drivers in a standard location where they're expected to exist, especially by official nvidia k8s components."
We'll be looking into https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/precompiled-drivers.html using pre-compiled drivers to solve the issue longterm, as the docs states, we'll still need to support existing method due to driver version and other features not being supported by precompiled drivers (https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/precompiled-drivers.html)
I believe this should be addressed in the next release of the GPU operator. See https://github.com/NVIDIA/gpu-operator/pull/747 and a few other related merged PRs. They have added variables to define custom driver paths
@achristianson until they release the new version, look into https://github.com/NVIDIA/k8s-device-plugin. Should address the multi GPU and time slicing features you wanted while you wait
@Hexoplon thank you. I think that k8s-device-plugin
is exactly what I wanted. It is installed as part of gpu-operator
, but it can run without it and gives me everything I need. In fact, this is even better because it doesn't have other unnecessary processes running.
This issue can be closed as far as I'm concerned, but I'll leave that up to the project maintainers depending on the approach they'd like to take with talos.
Bug Report
Preface
There is an existing issue #8402 regarding
gpu-operator
compatibility. That one was closed because talos is designed to be immutable and has proprietary nvidia drivers available.We're running our cluster using those drivers, but there's some things we're still missing which
gpu-operator
provides, namely the ability to track number of GPUs on a device and make them available for requests, e.g. a workload requests2
GPUs. Also, withgpu-operator
installed, workloads requesting multiple GPUs will be allocated GPUs which are more directly connected, e.g. with nvlink or a pcie switch.We believe the
gpu-operator
would work fine on talos if it could locate the talos-provided drivers. One could argue this bug report should be on thegpu-operator
project, however NVIDIA is quick to dismiss compatibility with any unsupported platforms, so instead we would need to find a way to make talos work withgpu-operator
somehow.Description
The
gpu-operator
by NVIDIA is unable to find drivers installed using instructions at https://www.talos.dev/v1.7/talos-guides/configuration/nvidia-gpu-proprietary/.Specifically, we're using drivers & nvidia toolkit prebuilt talos extensions.
We're able to run a test
nvidia-smi
workload:So we know nvidia drivers are installed and working. But when we install NVIDIA gpu operator:
we get multiple pods stuck in init state:
It seems that the main hangup is the
nvidia-operator-validator
. We can see this in the logs for itsdriver-validation
container:Environment
kubectl version --short
]libvirt
/qemu