NVIDIA GPU Operator cannot find talos nvidia drivers

achristianson commented 4 months ago

Bug Report

Preface

There is an existing issue #8402 regarding gpu-operator compatibility. That one was closed because talos is designed to be immutable and has proprietary nvidia drivers available.

We're running our cluster using those drivers, but there's some things we're still missing which gpu-operator provides, namely the ability to track number of GPUs on a device and make them available for requests, e.g. a workload requests 2 GPUs. Also, with gpu-operator installed, workloads requesting multiple GPUs will be allocated GPUs which are more directly connected, e.g. with nvlink or a pcie switch.

We believe the gpu-operator would work fine on talos if it could locate the talos-provided drivers. One could argue this bug report should be on the gpu-operator project, however NVIDIA is quick to dismiss compatibility with any unsupported platforms, so instead we would need to find a way to make talos work with gpu-operator somehow.

Description

The gpu-operator by NVIDIA is unable to find drivers installed using instructions at https://www.talos.dev/v1.7/talos-guides/configuration/nvidia-gpu-proprietary/.

Specifically, we're using drivers & nvidia toolkit prebuilt talos extensions.

We're able to run a test nvidia-smi workload:

$ kubectl run \                                                       
  nvidia-test \       
  --restart=Never \
  -ti --rm \
  --image nvcr.io/nvidia/cuda:12.1.0-base-ubuntu22.04 \
  --overrides '{"spec": {"runtimeClassName": "nvidia"}}' \
  nvidia-smi
Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "nvidia-test" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "nvidia-test" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "nvidia-test" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "nvidia-test" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
If you don't see a command prompt, try pressing enter.
warning: couldn't attach to pod/nvidia-test, falling back to streaming logs: Internal error occurred: Internal error occurred: error attaching to container: container is in CONTAINER_EXITED state
Mon Jul 15 23:07:16 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        On  | 00000000:07:00.0 Off |                  N/A |
|  0%   50C    P8              18W / 420W |      3MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
pod "nvidia-test" deleted

So we know nvidia drivers are installed and working. But when we install NVIDIA gpu operator:

helm install gpu-operator nvidia/gpu-operator \                     
    --namespace gpu-operator --create-namespace \
    --set driver.enabled=false \
    --set toolkit.enabled=false \                       
    --wait

we get multiple pods stuck in init state:

$ kubectl get pod -n gpu-operator                                     
NAME                                                         READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-sq9nj                                  0/1     Init:0/1   0          61m
gpu-operator-847b55dcb4-pzf55                                1/1     Running    0          66m
gpu-operator-node-feature-discovery-gc-6788b6ccf8-z8cxk      1/1     Running    0          66m
gpu-operator-node-feature-discovery-master-bc9c67575-l4khm   1/1     Running    0          66m
gpu-operator-node-feature-discovery-worker-8885l             1/1     Running    0          66m
gpu-operator-node-feature-discovery-worker-stvdp             1/1     Running    0          66m
nvidia-dcgm-exporter-qf4gp                                   0/1     Init:0/1   0          61m
nvidia-device-plugin-daemonset-862b2                         0/1     Init:0/1   0          61m
nvidia-operator-validator-4r5dh                              0/1     Init:0/4   0          62m

It seems that the main hangup is the nvidia-operator-validator. We can see this in the logs for its driver-validation container:

$ kubectl logs nvidia-operator-validator-4r5dh -c driver-validation -n gpu-operator | head
time="2024-07-15T22:07:08Z" level=info msg="version: 0fe1e8db, commit: 0fe1e8d"
time="2024-07-15T22:07:08Z" level=info msg="Driver is not pre-installed on the host. Checking driver container status."
running command bash with args [-c stat /run/nvidia/validations/.driver-ctr-ready]
stat: cannot statx '/run/nvidia/validations/.driver-ctr-ready': No such file or directory
command failed, retrying after 5 seconds
running command bash with args [-c stat /run/nvidia/validations/.driver-ctr-ready]
stat: cannot statx '/run/nvidia/validations/.driver-ctr-ready': No such file or directory
command failed, retrying after 5 seconds
running command bash with args [-c stat /run/nvidia/validations/.driver-ctr-ready]
stat: cannot statx '/run/nvidia/validations/.driver-ctr-ready': No such file or directory

Environment

Talos version:

Client:
        Tag:         v1.7.5
        SHA:         47731624ee4e22831d70d87d5cfdaa90ddeacd42

        Built:       
        Go version:  go1.22.4
        OS/Arch:     linux/amd64
Server:
        NODE:        172.22.0.16
        Tag:         v1.7.5
        SHA:         47731624
        Built:       
        Go version:  go1.22.4
        OS/Arch:     linux/amd64
        Enabled:

Kubernetes version: [kubectl version --short]

Client:
        Tag:         v1.7.5
        SHA:         47731624ee4e22831d70d87d5cfdaa90ddeacd42

        Built:       
        Go version:  go1.22.4
        OS/Arch:     linux/amd64
Server:
        NODE:        172.22.0.16
        Tag:         v1.7.5
        SHA:         47731624
        Built:       
        Go version:  go1.22.4
        OS/Arch:     linux/amd64
        Enabled:

Platform: libvirt/qemu

smira commented 4 months ago

You probably need to open an issue with NVIDIA gpu-operator, as the problem is in the driver detection phase?

As Talos Linux has drivers installed correctly, I'm not sure what else can we do.

achristianson commented 4 months ago

You probably need to open an issue with NVIDIA gpu-operator, as the problem is in the driver detection phase?

The problem is they consider talos to be an unsupported platform, so they won't put resources into making gpu-operator work with talos.

Something we could do in talos is make sure the right files are in the right place such that gpu-operator just works (which may be a fairly trivial change in talos).

Otherwise, the current GPU support in talos is limited to more basic GPU workloads which do not need multi GPU, topology optimization, mixed GPU features, time slicing, etc.

smira commented 4 months ago

It runs in a container, so unless it tries to escape to the PID 1 user namespace, you can pre-create any files it might need in the init container (assuming /run is within the container, not a host mount).

But I believe people are using all advanced GPU workloads with Talos without any issues (not sure what exactly is the path there).

achristianson commented 4 months ago

I believe you're right that it can be done with enough init containers and custom talos configs. Given gpu-operator is one of the primary ways people enable GPU workloads in clusters with features beyond the drivers and toolkit, it would be nice if it just worked in talos.

This bug report/feature request should probably be renamed/reconsidered to be: "put the drivers in a standard location where they're expected to exist, especially by official nvidia k8s components."

frezbo commented 4 months ago

We'll be looking into https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/precompiled-drivers.html using pre-compiled drivers to solve the issue longterm, as the docs states, we'll still need to support existing method due to driver version and other features not being supported by precompiled drivers (https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/precompiled-drivers.html)

Hexoplon commented 4 months ago

I believe this should be addressed in the next release of the GPU operator. See https://github.com/NVIDIA/gpu-operator/pull/747 and a few other related merged PRs. They have added variables to define custom driver paths

Hexoplon commented 4 months ago

@achristianson until they release the new version, look into https://github.com/NVIDIA/k8s-device-plugin. Should address the multi GPU and time slicing features you wanted while you wait

achristianson commented 4 months ago

@Hexoplon thank you. I think that k8s-device-plugin is exactly what I wanted. It is installed as part of gpu-operator, but it can run without it and gives me everything I need. In fact, this is even better because it doesn't have other unnecessary processes running.

This issue can be closed as far as I'm concerned, but I'll leave that up to the project maintainers depending on the approach they'd like to take with talos.

siderolabs / talos