Nvidia resource name is hardcoded to nvidia.com/gpu

Champ-Goblem commented 3 months ago

Which component are you using?: Cluster Autoscaler

Component version: Any

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Any

What environment is this in?: Any

What did you expect to happen?: Nodes report ready status when using a resource other than nvidia.com/gpu

What happened instead?:

The cluster autoscaler judges resource readiness for Nvidia cards based on the resource name nvidia.com/gpu this causes some issues with autoscaling when using other resource names for Nvidia cards.

For example, mig uses:

nvidia.com/mig-<slice_count>g.<memory_size>gb

and time-sliced instances can be set to use:

nvidia.com/gpu.shared or nvidia.com/mig-<slice_count>g.<memory_size>gb.shared

By hardcoding the Nvidia resource check to nvidia.com/gpu autoscaling starts to fail once the okTotalUnreadyCount value is reached.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

voelzmo commented 3 months ago

/area cluster-autoscaler

k8s-triage-robot commented 1 week ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kubernetes / autoscaler

Nvidia resource name is hardcoded to nvidia.com/gpu #7050