kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.1k stars 3.98k forks source link

Autoscaler removing kubernetes Job Pods leading to JobBackOff #7095

Open pythonking6 opened 3 months ago

pythonking6 commented 3 months ago

Which component are you using?:

cluster-autoscaler, running on AWS EKS

What version of the component are you using?: https://artifacthub.io/packages/helm/cluster-autoscaler/cluster-autoscaler/9.21.0

[ec2-user@ip-xx-xx-xx-xx ~]$ helm list -n kube-system NAME NAMESPACE CHART APP VERSION
cluster-autoscaler kube-system cluster-autoscaler-9.21.0 1.23.0 cluster-proportional-autoscaler kube-system cluster-proportional-autoscaler-1.0.1 1.8.6

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
v1.25.0

What environment is this in?:

EKS running in AWS. The deployed cluster is using kubernetes version 1.28

What did you expect to happen?:

I have 40 kubernetes jobs that are scheduled simultaneously. If I manually scale up to 40 GPUs and disable downscaling, I have no issues, all 40 jobs run to completion. However, when I let the autoscaler scale up based on the nvidia.com/gpu: 1 request in the job manifest, two things happen:

  1. The autoscaler scales up twice as many GPUs as needed (so 80 instead of 40).
  2. The autoscaler then realizes that’s too many gpus and starts to downscale after coolDownPeriod.

I expect the autoscaler to allocated 40 GPUs. I also expect the autoscaler to leave long running pods untouched until they complete. What happened instead?:

Some of the pods get a SIGTERM signal and terminate. This lead to a jobBackOffLimit reached (which I deliberately set to zero). Moreover, I have configured the autoscaler to have a utilization threshold of zero via the flag--scale-down-utilization-threshold=0, so that even if the pod isn’t using any of the gpu, the node under it should not be destroyed until the job is completed. How to reproduce it (as minimally and precisely as possible):

Run 40 kubernetes jobs in the same namespace and let the autoscaler scale up and down as it sees fit. Anything else we need to know?:

If I freeze the number of gpus to 40 and let the jobs run to completion, there are no issues. I have created a podDisruptionbudget of 1000 in the namespace for the jobs with a specific Label. Moreover, I have added the "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" to the job manifest.

adrianmoisey commented 3 months ago

/area cluster-autoscaler

davejab commented 3 months ago

Think I am experiencing a similar issue with chart version 9.37.0, EKS 1.28

The FAQ states:

What types of pods can prevent CA from removing a node? ...

  • Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc). *

This would imply jobs should be safe from eviction, however, can see cluster autoscaler evicting running jobs.

Did adding "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" successfully mitigate this for you?

k8s-triage-robot commented 3 weeks ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale