ALB Report 504 errors when every autoscaler decrease node size

ibalat commented 7 months ago

Which component are you using?:

autoscaler: 1.29.0
alb controller: 2.7.1

What k8s version are you using (kubectl version)?: v1.29.0

What environment is this in?: AWS EKS

What did you expect to happen and instead happened?: I use prestop hook to prevent 50x errors but when autoscaler run to decrease nodes, ALB report 504 and Target connection errors.

For example, autoscaler start to decrease node size at ~23:15 and ~02:10 and alb start to report 503, 504 and target connection errors at same time. Every night and light hours, I see same problem.

[IMPORTANT] I manually restarted, scaled and deployed all apps to see same errors but no any problem when do any operation. Because I use prestop. Errors coming only when autoscaler decrease node size, not increase them.

How to reproduce it (as minimally and precisely as possible):

Pods have preStop 35s
Graceful shutdown time (terminationGracePeriodSeconds) is 60s
ALB deregistration delay is 20s

ibalat commented 7 months ago

hi @sftim , please can you help for this issue

adrianmoisey commented 4 months ago

/area cluster-autoscaler

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

PaddyAdallah commented 3 days ago

[RECOMMENDATION - AutoScaling Group Termination Lifecycle Hook]

Amazon EC2 Auto Scaling offers the ability to add lifecycle hooks to your Auto Scaling groups. [1] [2]

These hooks let you create solutions that are aware of events in the Auto Scaling instance lifecycle, and then perform a custom action on instances when the corresponding lifecycle event occurs. A lifecycle hook provides a specified amount of time (one hour by default) to wait for the action to complete before the instance transitions to the next state.

In the event of a scale down like the one observed in your cluster, The lifecycle hook puts the instance into a wait state (Terminating:Wait)

The instance remains in a wait state either until you complete the lifecycle action, or until the timeout period ends (one hour by default). After you complete the lifecycle hook or the timeout period expires, the instance transitions to the next state (Terminating:Proceed) where the instance is terminated.

In your cluster's case, I recommend setting the timeout period to approximately 10 mins or less, depending on the amount of time that would be needed to make sure the node is successfully drained before termination.

[1] How lifecycle hooks work in Auto Scaling groups - https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks-overview.html

[2] Amazon EC2 Auto Scaling lifecycle hooks - https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html

kubernetes / autoscaler

ALB Report 504 errors when every autoscaler decrease node size #6679