Closed psharik1 closed 1 year ago
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
We're having the same issue here, did you find a solution/cause @psharik1 ?
/remove-lifecycle rotten
i am seeing the same problem.
my setup is very similar to @psharik1's setup:
cluster-autoscaler v1.23.1
...
- command:
- ./cluster-autoscaler
- --cloud-provider=aws
- --namespace=kube-system
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/build1-east1-us-prod
- --balance-similar-node-groups=true
- --expander=least-waste
- --logtostderr=true
- --scale-down-enabled=true
- --scale-down-utilization-threshold=0.875
- --skip-nodes-with-local-storage=false
- --skip-nodes-with-system-pods=false
- --stderrthreshold=info
- --v=4
env:
- name: AWS_REGION
value: us-east-1
image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.23.1
...
aws eks cluster v1.23 w/ managed node group.
$ kubectl version -o yaml
:
clientVersion:
buildDate: "2022-09-21T14:33:49Z"
compiler: gc
gitCommit: 5835544ca568b757a8ecae5c153f317e5736700e
gitTreeState: clean
gitVersion: v1.25.2
goVersion: go1.19.1
major: "1"
minor: "25"
platform: darwin/amd64
kustomizeVersion: v4.5.7
serverVersion:
buildDate: "2022-10-24T20:35:40Z"
compiler: gc
gitCommit: 55bd5d5cb7d32bc35e4e050f536181196fb8c6f7
gitTreeState: clean
gitVersion: v1.23.13-eks-fb459a0
goVersion: go1.17.13
major: "1"
minor: 23+
platform: linux/amd64
i only have a single managed node group (with a single instance type) so there are no complications around which asg cluster-autoscaler should be scaling up/down.
this node group is spread across three availability zones, however, which may be relevant here. i.e. i could see how aws's asg logic will try to keep an even balance of nodes across all availability zones despite what cluster-autoscaler is trying to do. 🤷
anyway, i have a single workload (bamboo-server
) in this cluster that is annotated with
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
cluster-autoscaler sees & recognizes the annotation as is evidenced by this log message during scale down:
I1117 19:26:58.695216 1 cluster.go:169] Fast evaluation: node ip-10-160-39-124.ec2.internal cannot be removed: pod annotated as not safe to evict present: bamboo-server-7895bd5c7c-4rj9c
however, the node that hosts the workload gets removed anyway... something that cluster-autoscaler appears to be surprised by:
I1117 19:30:40.334019 1 scale_down.go:667] Can't retrieve node ip-10-160-39-124.ec2.internal from snapshot, removing from unremovable map, err: node not found
my hypothesis is this:
cluster-autoscaler will select specific nodes for removal based on logic that includes looking at pod scheduling constraints and the above annotation.
however, in order for it to remove a node from the aws autoscaling group it also has to decrease the desired size of the autoscaling group by one… otherwise, if it removes a node from the autoscaling group, the autoscaling group will immediately spin up a new one.
so there’s either a bug or a race condition that causes the aws autoscaling group to pick a random node for termination when cluster-autoscaler decreases the desired size of the autoscaling group.
presumably, this is the code that is called when cluster-autoscaler deletes a particular node:
which is called from here:
which, in turn, is called from here:
the following log entries tell me that this code is actually being called during scale down operations:
I1121 17:43:16.757598 1 aws_manager.go:294] DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation
anyway, this code in particular suggests that it is possible to tell aws which node to remove from the asg while at the same time decreasing the asg desired size.
but this is where i suspect the problem lies. maybe aws only heeds the desired size and removes an arbitrary node from the asg. 🤷
afaict, my suspicion is confirmed by the fact that the kube audit log shows that the user eks:node-manager
(which is the cluster-external iam role arn:aws:iam::042475242167:role/AWSWesleyClusterManagerLambda-NodeManagerRole-1TIDI5IYNBQSD
) is responsible for evicting my workload... not cluster-autoscaler (which is system:serviceaccount:kube-system:cluster-autoscaler
).
maybe this issue is related 🤷: https://github.com/kubernetes/autoscaler/issues/3693
so, yeah... turns out that this is a known issue and is caused by the "AZRebalance" feature of the ec2 autoscaling group that is created by eks managed nodegroups. this feature ensures that ec2 instances in an asg are spread evenly across availability zones.
https://github.com/kubernetes/autoscaler/issues/3693 https://github.com/aws/containers-roadmap/issues/1453
there is a workaround but it comes with a caveat:
To resolve the issue you can either suspend the AZRebalance process so that your ASG will stop re-balancing the instances across AZs as mentioned in the documentation https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-benefits.html#AutoScalingBehavior.InstanceUsage or configure Scale-In protection : https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-instance-protection.html . NOTE : Please be cautious with this as Managed Nodegroups is a fully AWS managed offering and any out-of-band changes to the underlying resources may result in unexpected issues later and hence this is not recommended by us.
thanks to the excellent aws support engineer, Rahul J., who researched this for me ❤️
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
Which component are you using?:
Cluster Autoscaler
What version of the component are you using?: v1.21.2
Component version:
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?: Linux
AWS What did you expect to happen?: Cluster Nodes with safe-To-Evict set to false should not terminate
What happened instead?: Cluster Nodes with safe-To-Evict set to false got terminated
How to reproduce it (as minimally and precisely as possible): 1 ) Configure CA with default options:
RollingUpdateStrategy: 25% max unavailable, 25% max surge Pod Template: Labels: app=cluster-autoscaler Annotations: prometheus.io/port: 8085 prometheus.io/scrape: true Service Account: cluster-autoscaler Containers: cluster-autoscaler: Image: us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.21.2 Port:
Host Port:
Command:
./cluster-autoscaler
--v=4
--stderrthreshold=info
--cloud-provider=aws
--skip-nodes-with-local-storage=false
--expander=least-waste
--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/psharik-karpenter-demo1
--balance-similar-node-groups
--skip-nodes-with-system-pods=false
Limits:
cpu: 100m
memory: 500Mi
Requests:
cpu: 100m
memory: 500Mi
Environment:
2) Create a managed nodegroup (I used m5.large on demand instances)
3) Create a deployment with safe-to-evict set to False and scale out to 15-20 nodes.
4) Once completed scale in to 3-4 pods, making sure each pod is sitting on separate node.
5) Will see additional nodes is spun and exisitng node with annotation set is terminated with below error: _scaledown.go:666] Can't retrieve node ixxxxx,west-2.compute.internal from snapshot, removing from unremovable map, err: node not found
Commands:
kubectl get pods -o wide
nginx-to-scaleout-7595f64494-4w4bh 1/1 Running 0 12h 192.168.31.98 ip-192-168-23-30.us-west-2.compute.internal
xxxxx@8c85909fe07a /tmp % kubectl get nodes 13h v1.21.5-eks-9017834
NAME STATUS ROLES AGE VERSION ip-192-168-23-30.us-west-2.compute.internal Ready
2 mins later:
xxxxx@8c85909fe07a /tmp % kubectl get nodes NAME STATUS ROLES AGE VERSION ip-192-168-23-30.us-west-2.compute.internal Ready,SchedulingDisabled 13h v1.21.5-eks-9017834
ip-192-168-76-44.us-west-2.compute.internal NotReady 6s v1.21.5-eks-9017834 <--- New instance got spun up and was marked for removal after minute.
Anything else we need to know?: