Open ibalat opened 7 months ago
hi @sftim , please can you help for this issue
/area cluster-autoscaler
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
[RECOMMENDATION - AutoScaling Group Termination Lifecycle Hook]
Amazon EC2 Auto Scaling offers the ability to add lifecycle hooks to your Auto Scaling groups. [1] [2]
These hooks let you create solutions that are aware of events in the Auto Scaling instance lifecycle, and then perform a custom action on instances when the corresponding lifecycle event occurs. A lifecycle hook provides a specified amount of time (one hour by default) to wait for the action to complete before the instance transitions to the next state.
In the event of a scale down like the one observed in your cluster, The lifecycle hook puts the instance into a wait state (Terminating:Wait)
The instance remains in a wait state either until you complete the lifecycle action, or until the timeout period ends (one hour by default). After you complete the lifecycle hook or the timeout period expires, the instance transitions to the next state (Terminating:Proceed) where the instance is terminated.
In your cluster's case, I recommend setting the timeout period to approximately 10 mins or less, depending on the amount of time that would be needed to make sure the node is successfully drained before termination.
[1] How lifecycle hooks work in Auto Scaling groups - https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks-overview.html
[2] Amazon EC2 Auto Scaling lifecycle hooks - https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html
Which component are you using?:
What k8s version are you using (
kubectl version
)?: v1.29.0What environment is this in?: AWS EKS
What did you expect to happen and instead happened?: I use prestop hook to prevent 50x errors but when autoscaler run to decrease nodes, ALB report 504 and Target connection errors.
For example, autoscaler start to decrease node size at ~23:15 and ~02:10 and alb start to report 503, 504 and target connection errors at same time. Every night and light hours, I see same problem.
[IMPORTANT] I manually restarted, scaled and deployed all apps to see same errors but no any problem when do any operation. Because I use prestop. Errors coming only when autoscaler decrease node size, not increase them.
How to reproduce it (as minimally and precisely as possible):