kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.07k stars 3.97k forks source link

Nodes with safe-to-evict flag set to false gets evicted during scale down #4789

Closed psharik1 closed 1 year ago

psharik1 commented 2 years ago

Which component are you using?:

Cluster Autoscaler

What version of the component are you using?: v1.21.2

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"clean", BuildDate:"2021-07-15T20:58:09Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9", GitTreeState:"clean", BuildDate:"2021-10-29T23:32:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?: Linux

AWS What did you expect to happen?: Cluster Nodes with safe-To-Evict set to false should not terminate

What happened instead?: Cluster Nodes with safe-To-Evict set to false got terminated

How to reproduce it (as minimally and precisely as possible): 1 ) Configure CA with default options:

RollingUpdateStrategy: 25% max unavailable, 25% max surge Pod Template: Labels: app=cluster-autoscaler Annotations: prometheus.io/port: 8085 prometheus.io/scrape: true Service Account: cluster-autoscaler Containers: cluster-autoscaler: Image: us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.21.2 Port: Host Port: Command: ./cluster-autoscaler --v=4 --stderrthreshold=info --cloud-provider=aws --skip-nodes-with-local-storage=false --expander=least-waste --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/psharik-karpenter-demo1 --balance-similar-node-groups --skip-nodes-with-system-pods=false Limits: cpu: 100m memory: 500Mi Requests: cpu: 100m memory: 500Mi Environment:

2) Create a managed nodegroup (I used m5.large on demand instances)

3) Create a deployment with safe-to-evict set to False and scale out to 15-20 nodes.

4) Once completed scale in to 3-4 pods, making sure each pod is sitting on separate node.

5) Will see additional nodes is spun and exisitng node with annotation set is terminated with below error: _scaledown.go:666] Can't retrieve node ixxxxx,west-2.compute.internal from snapshot, removing from unremovable map, err: node not found

Commands:

kubectl get pods -o wide

nginx-to-scaleout-7595f64494-4w4bh 1/1 Running 0 12h 192.168.31.98 ip-192-168-23-30.us-west-2.compute.internal

xxxxx@8c85909fe07a /tmp % kubectl get nodes
NAME STATUS ROLES AGE VERSION ip-192-168-23-30.us-west-2.compute.internal Ready 13h v1.21.5-eks-9017834

2 mins later:

xxxxx@8c85909fe07a /tmp % kubectl get nodes NAME STATUS ROLES AGE VERSION ip-192-168-23-30.us-west-2.compute.internal Ready,SchedulingDisabled 13h v1.21.5-eks-9017834 ip-192-168-76-44.us-west-2.compute.internal NotReady 6s v1.21.5-eks-9017834 <--- New instance got spun up and was marked for removal after minute.

Anything else we need to know?:

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Gabisonfire commented 2 years ago

We're having the same issue here, did you find a solution/cause @psharik1 ?

Gabisonfire commented 2 years ago

/remove-lifecycle rotten

sdickhoven commented 1 year ago

i am seeing the same problem.

my setup is very similar to @psharik1's setup:

cluster-autoscaler v1.23.1

...
      - command:
        - ./cluster-autoscaler
        - --cloud-provider=aws
        - --namespace=kube-system
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/build1-east1-us-prod
        - --balance-similar-node-groups=true
        - --expander=least-waste
        - --logtostderr=true
        - --scale-down-enabled=true
        - --scale-down-utilization-threshold=0.875
        - --skip-nodes-with-local-storage=false
        - --skip-nodes-with-system-pods=false
        - --stderrthreshold=info
        - --v=4
        env:
        - name: AWS_REGION
          value: us-east-1
        image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.23.1
...

aws eks cluster v1.23 w/ managed node group.

$ kubectl version -o yaml:

clientVersion:
  buildDate: "2022-09-21T14:33:49Z"
  compiler: gc
  gitCommit: 5835544ca568b757a8ecae5c153f317e5736700e
  gitTreeState: clean
  gitVersion: v1.25.2
  goVersion: go1.19.1
  major: "1"
  minor: "25"
  platform: darwin/amd64
kustomizeVersion: v4.5.7
serverVersion:
  buildDate: "2022-10-24T20:35:40Z"
  compiler: gc
  gitCommit: 55bd5d5cb7d32bc35e4e050f536181196fb8c6f7
  gitTreeState: clean
  gitVersion: v1.23.13-eks-fb459a0
  goVersion: go1.17.13
  major: "1"
  minor: 23+
  platform: linux/amd64

i only have a single managed node group (with a single instance type) so there are no complications around which asg cluster-autoscaler should be scaling up/down.

this node group is spread across three availability zones, however, which may be relevant here. i.e. i could see how aws's asg logic will try to keep an even balance of nodes across all availability zones despite what cluster-autoscaler is trying to do. 🤷

anyway, i have a single workload (bamboo-server) in this cluster that is annotated with

cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

cluster-autoscaler sees & recognizes the annotation as is evidenced by this log message during scale down:

I1117 19:26:58.695216 1 cluster.go:169] Fast evaluation: node ip-10-160-39-124.ec2.internal cannot be removed: pod annotated as not safe to evict present: bamboo-server-7895bd5c7c-4rj9c 

however, the node that hosts the workload gets removed anyway... something that cluster-autoscaler appears to be surprised by:

I1117 19:30:40.334019 1 scale_down.go:667] Can't retrieve node ip-10-160-39-124.ec2.internal from snapshot, removing from unremovable map, err: node not found 

my hypothesis is this:

cluster-autoscaler will select specific nodes for removal based on logic that includes looking at pod scheduling constraints and the above annotation.

however, in order for it to remove a node from the aws autoscaling group it also has to decrease the desired size of the autoscaling group by one… otherwise, if it removes a node from the autoscaling group, the autoscaling group will immediately spin up a new one.

so there’s either a bug or a race condition that causes the aws autoscaling group to pick a random node for termination when cluster-autoscaler decreases the desired size of the autoscaling group.

presumably, this is the code that is called when cluster-autoscaler deletes a particular node:

https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-1.23.1/cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go#L294-L299

which is called from here:

https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-1.23.1/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L289-L297

which, in turn, is called from here:

https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-1.23.1/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go#L279-L301

the following log entries tell me that this code is actually being called during scale down operations:

I1121 17:43:16.757598 1 aws_manager.go:294] DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation 

anyway, this code in particular suggests that it is possible to tell aws which node to remove from the asg while at the same time decreasing the asg desired size.

but this is where i suspect the problem lies. maybe aws only heeds the desired size and removes an arbitrary node from the asg. 🤷

afaict, my suspicion is confirmed by the fact that the kube audit log shows that the user eks:node-manager (which is the cluster-external iam role arn:aws:iam::042475242167:role/AWSWesleyClusterManagerLambda-NodeManagerRole-1TIDI5IYNBQSD) is responsible for evicting my workload... not cluster-autoscaler (which is system:serviceaccount:kube-system:cluster-autoscaler).

maybe this issue is related 🤷: https://github.com/kubernetes/autoscaler/issues/3693

sdickhoven commented 1 year ago

so, yeah... turns out that this is a known issue and is caused by the "AZRebalance" feature of the ec2 autoscaling group that is created by eks managed nodegroups. this feature ensures that ec2 instances in an asg are spread evenly across availability zones.

https://github.com/kubernetes/autoscaler/issues/3693 https://github.com/aws/containers-roadmap/issues/1453

there is a workaround but it comes with a caveat:

To resolve the issue you can either suspend the AZRebalance process so that your ASG will stop re-balancing the instances across AZs as mentioned in the documentation https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-benefits.html#AutoScalingBehavior.InstanceUsage or configure Scale-In protection : https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-instance-protection.html . NOTE : Please be cautious with this as Managed Nodegroups is a fully AWS managed offering and any out-of-band changes to the underlying resources may result in unexpected issues later and hence this is not recommended by us.

thanks to the excellent aws support engineer, Rahul J., who researched this for me ❤️

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 year ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/autoscaler/issues/4789#issuecomment-1567571204): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.