kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.04k stars 3.96k forks source link

Nvidia resource name is hardcoded to nvidia.com/gpu #7050

Open Champ-Goblem opened 3 months ago

Champ-Goblem commented 3 months ago

Which component are you using?: Cluster Autoscaler

Component version: Any

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Any

What environment is this in?: Any

What did you expect to happen?: Nodes report ready status when using a resource other than nvidia.com/gpu

What happened instead?:

The cluster autoscaler judges resource readiness for Nvidia cards based on the resource name nvidia.com/gpu this causes some issues with autoscaling when using other resource names for Nvidia cards.

For example, mig uses:

nvidia.com/mig-<slice_count>g.<memory_size>gb

and time-sliced instances can be set to use:

nvidia.com/gpu.shared or nvidia.com/mig-<slice_count>g.<memory_size>gb.shared

By hardcoding the Nvidia resource check to nvidia.com/gpu autoscaling starts to fail once the okTotalUnreadyCount value is reached.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

voelzmo commented 3 months ago

/area cluster-autoscaler

k8s-triage-robot commented 1 week ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale