Cluster autoscaler deleting nodes containing pods with `safe-to-evict: false` annotation

blueprismo commented 2 months ago

Which component are you using?: Cluster autoscaler

What version of the component are you using?: v1.27.1

Component version: v1.27.1 What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
```sh
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.13", GitCommit:"96b450c75ae3c48037f651b4777646dcca855ed0", GitTreeState:"clean", BuildDate:"2024-04-16T15:03:38Z", GoVersion:"go1.21.9", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.16-eks-2f46c53", GitCommit:"c1665482a8b066c35d81db51f8d8cc92aa598040", GitTreeState:"clean", BuildDate:"2024-07-25T04:23:25Z", GoVersion:"go1.22.5", Compiler:"gc", Platform:"linux/amd64"}
```

What environment is this in?: EKS - AWS

What did you expect to happen?: The Autoscaler sees the pod annotation cluster-autoscaler.kubernetes.io/safe-to-evict: false and respects it. Waiting for the pod to complete/finish before removing the node it is living on.

What happened instead?: The scale-down does NOT respect the cluster-autoscaler.kubernetes.io/safe-to-evict: false annotation and deleted the node, therefore killing my very-important running pod.

How to reproduce it (as minimally and precisely as possible): Have your autoscaler with these values at v1.27:

./cluster-autoscaler
      --cloud-provider=aws
      --namespace=kube-system
      --node-group-auto-discovery=tagstagstags
      --logtostderr=true
      --stderrthreshold=info
      --v=4

## ASG configs:
# Desired: 2
# minimum: 1

Spawn your very important pod that shouldn't be killed:

apiVersion: v1
kind: Pod
metadata:
  namespace: awx
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
spec:
  containers:
    - image: 'quay.io/ansible/awx-ee:23.1.0'
      name: worker
      args:
        - ansible-runner
        - worker
        - '--private-data-dir=/runner'
      resources:
        limits:
          memory: 2Gi
          cpu: 2
        requests:
          memory: 500Mi
          cpu: 500m
  tolerations:
  - key: nodegroup-type
    operator: "Equal"
    value: on-demand
  nodeSelector:
    eks.amazonaws.com/capacityType: ON_DEMAND

Afterwards add some resource-locking deployment like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: exhaust-resources
  namespace: awx
spec:
  replicas: 5
  selector:
    matchLabels:
      app: exhaust-resources
  template:
    metadata:
      labels:
        app: exhaust-resources
    spec:
      tolerations:
      - key: nodegroup-type
        operator: "Equal"
        value: on-demand
      nodeSelector:
        eks.amazonaws.com/capacityType: ON_DEMAND
      containers:
      - name: exhaust-resources
        image: busybox
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        command: ["sh", "-c", "while true; do echo 'Running...'; sleep 30; done;"]

This will trigger a scale up of the pods. When the scale down happens, cross your fingers that the initial pod is not killed on the way. The annotation for the pod won't be respected at all.

Anything else we need to know?:

I have some hypothesis, like maybe the Instance scale-in protection from the ASG is disabled by default. And this may take precedence over any AutoScaler will.

Another one is that the annotation should be on a deployment level, because my very important workload is running directly from a pod. (no rs/deployment on top of it).

adrianmoisey commented 2 months ago

/area cluster-autoscaler

erdincmemsource commented 2 weeks ago

Anyone knows what is the latest version of cluster-autoscaler that it doesn't have this bug?

blueprismo commented 2 weeks ago

Anyone knows what is the latest version of cluster-autoscaler that it doesn't have this bug?

I don't know, but I could manage to "work around it" by setting a crazy high Pod Disruption Budget with MinAvailable.

SeanZG8 commented 2 weeks ago

@blueprismo have experienced a similar issue, we suspended the AZRebalance process (under ASG -> Advanced Configuration) on the ASG itself. We suspected this process was killing nodes to rebalance (outside of autoscaler control), causing behavior that seemed as though auto scaler wasn't respecting safe-to-evict annotation.

kubernetes / autoscaler

Cluster autoscaler deleting nodes containing pods with `safe-to-evict: false` annotation #7244