Bump nvidia driver for CUDA 12.1 support

kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management

https://kops.sigs.k8s.io/

Apache License 2.0

15.83k stars 4.64k forks source link

Bump nvidia driver for CUDA 12.1 support #16557

Open ddelange opened 4 months ago

ddelange commented 4 months ago

/kind feature

1. Describe IN DETAIL the feature/behavior/change you would like to see.

Bump the nvidia driver for CUDA 12.1 support.

kops source currently on nvidia-headless-515-server
ubuntu repo latest available version currently 550

We are currently running our 1.26 cluster configured with DriverPackage nvidia-driver-535 (with CUDA 12.0 support).

Note that we moved away from nvidia-headless-XXX-server because it does not install nvidia-smi and some other binaries on the host system, which caused issues for us with some cuda docker images that rely on the host having them available. It's barely any additional disk space used, so that was a quick and effective fix.

When we tried bumping to nvidia-driver-550 on EC2, the nodes stopped registering in the cluster. It was reverted before I could pull logs from the host system I'm afraid. I hope that your CI catches this.

2. Feel free to provide a design supporting your feature request.

ddelange commented 4 months ago

cc @hakman via blame:)

hakman commented 4 months ago

Unfortunately, our CI does not run on GPU nodes. We don't have credits to run such tests at the moment...

ddelange commented 4 months ago

just double-checked: nvidia-driver-535 already supports CUDA 12.1, which runs stable in our cluster

ddelange commented 4 months ago

Opened https://github.com/kubernetes/kops/pull/16560

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 week ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten