Downscaling nodes on EKS causes 504 errors in application.

databonanza commented 2 years ago

Which component are you using?: cluster-autoscaler

What version of the component are you using?: 9.11.0

What k8s version are you using (kubectl version)?: v1.18.20-eks-8c49e2

What environment is this in?: AWS

What did you expect to happen?: We expect that when we reduce replicas for an application that causes a node scale down for that to not cause any downtime as cluster-autoscaler tells AWS to shut off ec2 nodes that are still running kube-proxy (and receiving traffic).

What happened instead?: The application returns 504 errors when the nodes are removed even though the pods running the application have already been moved to the nodes that are staying up for over 10 minutes.

How to reproduce it (as minimally and precisely as possible): Scale up by setting replicas to a high number (150) and then scale back down to a low number (50). Monitor using a load testing tool (50 concurrent users) while the different stages of the scale down occur.

Anything else we need to know?: I believe this is a bug in how cluster-autoscaler is notifying AWS to shut off nodes. It should tell AWS to drain connections to the nodes prior to removing the node completely. This "feels" like it's just telling AWS to shut the node off even though kube-proxy is still running on it.... thus causing 504s for a short period of time.

matti commented 2 years ago

do you have PodDisruptionBudgets sets for your app? are you running managed nodegroups?

managed nodegroups will drain the node properly when PodDisruptionBudgets are set

databonanza commented 2 years ago

The problem is that the node isn't deregistered from the NLB prior to it's ec2 instance being shut down. We are not having an issue with our application pods draining from the nodes. Each node that ends up "going away" drains to 3 pods only which are all related to the k8s infrastructure. Kube-proxy is one of them and I think this is the root of the issue. Kube-proxy being running within the cluster when it's ec2 instance is suddenly shut off. The NLB should stop sending traffic to the node prior to the ec2 instance stopping.

I hope this helps clear up what the problem we're experiencing is.

databonanza commented 2 years ago

What is happening: Node is detected as uneeded by CA. CA waits 10m until it really starts the necessary steps to remove the nodes CA drains the node of all pods except daemonsets (which includes kube-proxy) Node keeps receiving requests because kube-proxy is still there CA after drain does an ec2 terminate-instance All remaining pods are abruptly stopped because the instance is terminated LB takes about ~3m to detect the node is down and remove it from balance

What is supposed to happen: Node is detected as uneeded by CA. CA waits 10m until it really starts the necessary steps to remove the nodes CA adds node.kubernetes.io/exclude-from-external-load-balancers to node label This makes the LB deregister the node gracefully CA waits for Node to be deregistered CA drains the node of all pods except daemonsets (which includes kube-proxy) CA after drain does an ec2 terminate-instance

databonanza commented 2 years ago

Anyone? This seems to be a long standing issue with cluster-autoscaler after reading through the other bug reports.

chenggaw commented 2 years ago

any update？

Prajithp commented 2 years ago

We had the same issue but we managed to solve this some extent by implementing following workarounds.

Cordon the node before its getting deleted from the cluster. You can set --cordon-node-before-terminating=true cli argument in deployment yaml.
Increase the degregistration delay in targetgroup to which ever is needed for in-flight requests to be completed.
Enable lifecycle termination hooks on ASG for the worker nodes. Set the termination wait period as same as degregistration delay.

by implementing above workarounds, we were managed to bring down the 504 errors to 0.

hades1712 commented 2 years ago

same here, and if the problem is kube-proxy is still receiving traffic when node shut down ,can we change the cluster-mode to ip-mode to send traffic to Pods directly ?

hitsub2 commented 2 years ago

Node keeps receiving requests because kube-proxy is still there

I think even if the kube-proxy is still there, if keep the externaltraffic as default(cluster), as there are no other application pods running on the pod. It should forward the request to other nodes. Have you try to set a prestop for the pod?

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/autoscaler/issues/4757#issuecomment-1295739804): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

awx-fuyuanchu commented 1 year ago

any update？We also have this issue on GCP that the ILB couldn't aware of the removal of the node in time.

matti commented 1 year ago

https://github.com/matti/k8s-prestop-sidecar

I wrote this, would this help?

ibalat commented 7 months ago

any update? please solve this problem now

same issue: https://github.com/kubernetes/autoscaler/issues/6679

ibalat commented 7 months ago

@databonanza do you have any solution? Problem still exists for you?

databonanza commented 7 months ago

@databonanza do you have any solution? Problem still exists for you?

I do not. My team fixed the issue or worked around it and it's been so long since this happened that I do not recall what was done to resolve. My suspicion is that they worked around the issue vs. fixing it. We would have submitted a bug fix if we knew what the proper solution was.

I find it ridiculous, however, that such an issue can persist for so long without any support from the k8s community.

PaddyAdallah commented 6 days ago

[RECOMMENDATION - AutoScaling Group Termination Lifecycle Hook]

Amazon EC2 Auto Scaling offers the ability to add lifecycle hooks to your Auto Scaling groups

These hooks let you create solutions that are aware of events in the Auto Scaling instance lifecycle, and then perform a custom action on instances when the corresponding lifecycle event occurs. A lifecycle hook provides a specified amount of time (one hour by default) to wait for the action to complete before the instance transitions to the next state.

In the event of a scale down like the one observed in your cluster, The lifecycle hook puts the instance into a wait state (Terminating:Wait)

The instance remains in a wait state either until you complete the lifecycle action, or until the timeout period ends (one hour by default). After you complete the lifecycle hook or the timeout period expires, the instance transitions to the next state (Terminating:Proceed) where the instance is terminated.

In your cluster's case, I recommend setting the timeout period to approximately 10 mins or less, depending on the amount of time that would be needed to make sure the node is successfully drained before termination.

[1] How lifecycle hooks work in Auto Scaling groups - https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks-overview.html

[2] Amazon EC2 Auto Scaling lifecycle hooks - https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html

kubernetes / autoscaler

Downscaling nodes on EKS causes 504 errors in application. #4757