Open pythonking6 opened 3 months ago
/area cluster-autoscaler
Think I am experiencing a similar issue with chart version 9.37.0
, EKS 1.28
The FAQ states:
What types of pods can prevent CA from removing a node? ...
- Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc). *
This would imply jobs should be safe from eviction, however, can see cluster autoscaler evicting running jobs.
Did adding "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
successfully mitigate this for you?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Which component are you using?:
cluster-autoscaler, running on AWS EKS
What version of the component are you using?: https://artifacthub.io/packages/helm/cluster-autoscaler/cluster-autoscaler/9.21.0
[ec2-user@ip-xx-xx-xx-xx ~]$ helm list -n kube-system NAME NAMESPACE CHART APP VERSION
cluster-autoscaler kube-system cluster-autoscaler-9.21.0 1.23.0 cluster-proportional-autoscaler kube-system cluster-proportional-autoscaler-1.0.1 1.8.6
Component version:
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
EKS running in AWS. The deployed cluster is using kubernetes version 1.28
What did you expect to happen?:
I have 40 kubernetes jobs that are scheduled simultaneously. If I manually scale up to 40 GPUs and disable downscaling, I have no issues, all 40 jobs run to completion. However, when I let the autoscaler scale up based on the
nvidia.com/gpu: 1
request in the job manifest, two things happen:I expect the autoscaler to allocated 40 GPUs. I also expect the autoscaler to leave long running pods untouched until they complete. What happened instead?:
Some of the pods get a SIGTERM signal and terminate. This lead to a jobBackOffLimit reached (which I deliberately set to zero). Moreover, I have configured the autoscaler to have a utilization threshold of zero via the flag
--scale-down-utilization-threshold=0
, so that even if the pod isn’t using any of the gpu, the node under it should not be destroyed until the job is completed. How to reproduce it (as minimally and precisely as possible):Run 40 kubernetes jobs in the same namespace and let the autoscaler scale up and down as it sees fit. Anything else we need to know?:
If I freeze the number of gpus to 40 and let the jobs run to completion, there are no issues. I have created a podDisruptionbudget of 1000 in the namespace for the jobs with a specific Label. Moreover, I have added the
"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
to the job manifest.