Open Champ-Goblem opened 3 months ago
/area cluster-autoscaler
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Which component are you using?: Cluster Autoscaler
Component version: Any
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?: Any
What did you expect to happen?: Nodes report ready status when using a resource other than
nvidia.com/gpu
What happened instead?:
The cluster autoscaler judges resource readiness for Nvidia cards based on the resource name
nvidia.com/gpu
this causes some issues with autoscaling when using other resource names for Nvidia cards.For example, mig uses:
nvidia.com/mig-<slice_count>g.<memory_size>gb
and time-sliced instances can be set to use:
nvidia.com/gpu.shared
ornvidia.com/mig-<slice_count>g.<memory_size>gb.shared
By hardcoding the Nvidia resource check to
nvidia.com/gpu
autoscaling starts to fail once theokTotalUnreadyCount
value is reached.How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?: