Getting 502/504 with Pod Readiness Gates during rolling updates

calvinbui commented 3 years ago

I'm making use of the Pod Readiness Gate on Kubernetes Deployments running Golang-based APIs. The goal is to achieve full zero downtime deployments.

During a rolling update of the Kubernetes Deployment, I'm getting 502/504 responses from these APIs. This did not happen when setting target-type: instance.

I believe the problem is that AWS does not drain the pod from the LB before Kubernetes terminates it

Timeline of events:

Perform a rolling update on the deployment (1 replica)
A second pod is created in the deployment
AWS registers a second target in the Load Balancing Target Group
Both pods begin receiving traffic
I'm not sure what happens first at this point: a. AWS begins de-registered/drained the target b. Kubernetes begins terminating the pod
Traffic sent to the deployment begins receiving 502 and 504 errors
The old pod is deleted
Traffic returns to normal (200)
The target is de-registered/drained (depending on delay)

This is tested with a looping curl command:

while true; do
  curl --write-out '%{url_effective} - %{http_code} -' --silent --output /dev/null -L https://example.com | pv -N "$(date +"%T")" -t
  sleep 1
done

Results:

https://example.com - 200 - 13:04:16: 0:00:00
https://example.com - 502 - 13:04:17: 0:00:01
https://example.com - 200 - 13:04:20: 0:00:00
https://example.com - 504 - 13:04:31: 0:00:10
https://example.com - 200 - 13:04:32: 0:00:00
https://example.com - 200 - 13:04:33: 0:00:00
https://example.com - 200 - 13:04:34: 0:00:00
https://example.com - 200 - 13:04:35: 0:00:00
https://example.com - 200 - 13:04:36: 0:00:00

AirbornePorcine commented 3 years ago

We've been having the same issue. We confirmed with AWS that there is some propagation time between when some target is marked draining in a target group, and when that target actually stops receiving new connections. So, at the suggestion of other issues I've seen in the old project for this, we added a 20s sleep in a preStop script. This hasn't entirely eliminated them though, they still happen on deployment, just not with as much volume. Following this to see if anyone else has any good ideas, as troubleshooting these 502s has been infuriatingly difficult.

M00nF1sh commented 3 years ago

@calvinbui The pods needs to have a preStop hook to sleep. since most web framework(e.g. nginx/apache) will stop accept new connections once requested soft stop(sigTerm). and it take some time for the controller to deregister pod(after got endpoint change event), and take time for elb to propagate target changes to it's dataplane.

@AirbornePorcine did you still saw 502 with 20s sleep? have you enabled pod readinessGate? If you are using instance mode, u need 30 second extra sleep(since kubeproxy update iptable rules per 30 second).

AirbornePorcine commented 3 years ago

@M00nF1sh that's correct, even with a 20s sleep and the auto-injected readinessGate, doing a rolling restart of my pods results in a small amount of 502s. For reference this is like 5-6 502s out of 1m total requests in the same time period, so a very small amount, but still not something we want. I'm using IP mode here.

M00nF1sh commented 3 years ago

@AirbornePorcine in my own test, the sum of controller process time(from pod kill to target deregistered) and ELB API propagation time(from deregister API call to targets actually removed from ELB dataplane) takes less than 10 second.

And the PreStop hook sleep only need to be controller process time + ELB API propagation time + HTTP req/resp RTT.

Just asked ELB team whether they have p90/p99 metrics available for ELB API propagation time. If so, we recommend a safe PreStop sleep.

AirbornePorcine commented 3 years ago

Ok, so, we just did some additional testing on that sleep timing.

The only way we've been able to get zero 502s during a rolling deploy, is to set our preStop sleep to the target group's deregistration delay + at least 5s. It seems almost like there's no way to guarantee that AWS isn't actually sending you new requests, until the target is fully removed from the target group, and not just marked "draining".

Looking back in my emails, I realized this is exactly what AWS support had previously told us to do - don't stop the target from processing requests until the target group deregistration delay has elapsed at minimum (we added the 5s to account for the controller process and propagation time as you mentioned).

Next week we'll try tweaking our deregistration delay and see if the same holds true (it's currently 60s, but we really don't want to sleep that long if we can avoid it)

Something you might want to try though @calvinbui!

calvinbui commented 3 years ago

Thanks for the comments.

Adding a preStop and sleep, I was able to get all 200s during a rolling update of the deployment. I set deregistration time to 20 seconds and sleep to 30 seconds.

However during a node upgrade/rolling update I got 503s for around one minute. Are there any recommendations from AWS about that? I'm guessing I would need to bump up the deregistration and probably the sleep times a lot higher to allow the new node to fire up and the new pods to start as well.

calvinbui commented 3 years ago

After increasing sleep to 90s and terminationGracePeriod to 120s there are no downtimes during a cluster upgrade/node upgrade on EKS.

However, if a deployment only has 1 replica, there is still ~1 min of downtime. For deployments with >=2 replicas, this was not a problem and no downtime was observed.

The documentation should be updated, so I'll leave this issue open.

EDIT: For the 1 replica issue, it was because k8s doesn't do a rolling deployment during a cluster/node upgrade. It is considered involuntary so I had to scale up to 2 replicas and add a PDB

foriequal0 commented 3 years ago

How about abusing (?) validationAdmissionWebHook for delaying pod deletion? Here's the sketch of the idea:

ValidataionAdmissionWebhook intercepts pod deletion. It won't allow deletion of the pod if the pod is is reachable from the alb, ip type ingress first.
However, it patches the pod. It removes labels and ownerReferences so it is removed from ReplicationSet and Endpoint. Also ELB starts draining since it is removed from Endpoint.
After some time passes, and ELB finishes its draining, the pod is deleted by aws-load-balancer-controller.

edit: I've implemented this idea into a chart here. https://github.com/foriequal0/pod-graceful-drain

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

project0 commented 2 years ago

This is still a serious issue, any update on it? We use currently the solution from @foriequal0 which is really doing a great job so far. I wish this would be officially handled by the controller project itself.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

project0 commented 2 years ago

/remove-lifecycle stale

ardove commented 2 years ago

What's the protocol for getting this prioritized? We've hit it as well. This is a serious issue and while I understand there's a workaround (hack), it's certainly reducing my confidence in running production workloads on this thing.

albgus commented 2 years ago

I'm also seeing this issue, but I think it's not necessarily an issue with the LB Controller? It seems draining for NLBs doesn't work as I would have expected. Instead of stopping new connections and letting existing connections continue it continues to send new connections to the draining targets for a while.

From my testing the actual delay for a target to be fully de-registered and drained seems to be around 2-3 minutes.

Adding this to each container exposed behind an NLB have worked for me so far.

          lifecycle:
            preStop:
              exec:
                command: [ sh, -c, "sleep 180" ]

I would love to be able to get rid of this but it simply seems that the NLBs are extremely slow in performing management operations. I have even seen target registrations take almost 10 minutes.

dfinucane commented 2 years ago

I completely agree with what @ardove has said.

The point of this readinessGate feature is to delay the termination of the pod as long as the LB needs it. If I have to update my chart to put a sleep in the preStop hook then it means that this feature is not working. If I have to use the preStop hook then i might as well not even use this readinessGate feature.

In my observation the pod is allowed to terminate as soon as the new target group becomes ready/healthy. I have seen that the old target group was still draining after the pod terminates and obviously that's going to result in 502 errors for those requests.

This feature almost works. Without the feature enabled I see 30 seconds to 1 minute of solid 502 errors. With the feature enabled I get a brief sluggishness and maybe 1 or a handful of 502's. Hopefully you can get this fixed because unfortunately close to good isn't good enough for something like this.

aaron-hastings-travelport commented 2 years ago

I thought it might be useful to share this KubeCon talk, "The Gotchas of Zero-Downtime Traffic /w Kubernetes", where the speaker goes into the strategies required for zero-downtime rolling updates with Kubernetes deployments (at least as of 2022):

https://www.youtube.com/watch?v=0o5C12kzEDI

It can be a bit hard to conceptualise the limitations of the async nature of Ingress/Endpoint objects and Pod termination, so I found the above talk (and live demo) helped a lot.

Hopefully it's useful for others.

jyotibhanot commented 2 years ago

@M00nF1sh I am implementing the same in my kubernetes cluster but unable to calculate the sleep time for prestop hook and terminationGracePeriodSeconds. Currently terminationGracePeriodSeconds is 120 seconds, deregistration delay is 300 seconds.Do we have any mechanism to calculate this?

project0 commented 1 year ago

Does anyone have a update on this? After almost two years i cannot see that it has been solved natively yet.

project0 commented 1 year ago

I wonder if finalizers would solve this problem nicely here :thinking:

project0 commented 1 year ago

For clusters using traefik proxy as ingress it might be worth looking also into the entrypoint lifecycle feature to control graceful shutdowns https://doc.traefik.io/traefik/routing/entrypoints/#lifecycle. At least in this case it avoids the need for the sleep workaround :-)

smulikHakipod commented 1 year ago

https://www.reddit.com/r/ProgrammerHumor/comments/1092kmf/just_add_sleep/j3vqiv2?utm_medium=android_app&utm_source=share&context=3

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

dongho-jung commented 1 year ago

would EndpointSlice terminating condition solve this issue? it says "Consumers of the EndpointSlice API, such as Kube-proxy and Ingress Controllers, can now use these conditions to coordinate connection draining events, by continuing to forward traffic for existing connections but rerouting new connections to other non-terminating endpoints." but i'm not sure it would work too in this case

https://kubernetes.io/blog/2022/12/30/advancements-in-kubernetes-traffic-engineering/

ThisIsQasim commented 1 year ago

/remove-lifecycle rotten

rkubik-hostersi commented 1 year ago

Bumping this issue. Adding sleep() does not sound professional, it's a workaround and only workaround :/

dusansusic commented 12 months ago

I am experiencing this issue, too.

OverStruck commented 11 months ago

any update? does pod readiness gate work w/ v2.6 ?

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kdavh commented 7 months ago

Hi folks, I wanted to add that I experimented with all suggested solutions here and what finally worked for me.

I tried extra sleep during preStop for container and matching extra terminationGracePeriod for the pod, reducing ALB deregistration delay, during preStop explicitly turning the pod healthcheck unhealthy among various experiments and combinations. Even extending the termination for 10 minutes didn't stop traffic continually flowing from the ALBs and the small number of errors right as the pods finished termination.

--> I finally tried turning alb.ingress.kubernetes.io/target-type from instance to ip and that fixed it.

After reflecting, I don't know why I thought instance would ever work cleanly. The ALB is tracking node health, and the incoming and outgoing pods can be arranged randomly on those nodes. I'm not even sure from the ALB's perspective it ever saw a node go unhealthy, because there are multiple pods on each node, so healthcheck always periodically passes.

dickfickling commented 7 months ago

/remove-lifecycle rotten

adityapatadia commented 6 months ago

If anyone faces this, you should do this:

Use alb.ingress.kubernetes.io/target-type: ip
Make sure you use v2.x of ALB Controller and set this label on the namespace where you are putting your pods: kubectl label namespace <your_namespace> elbv2.k8s.aws/pod-readiness-gate-inject=enabled
Reduce de-register delay by applying this to your Ingress: alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30 (by default it's 300 second which is too high)
Setup that sleep delay with preStop hook.

More information in this long article: https://easoncao.com/zero-downtime-deployment-when-using-alb-ingress-controller-on-amazon-eks-and-prevent-502-error/

This makes 502/504 go away completely.

stepan-romankov commented 6 months ago

If anyone faces this, you should do this:

Use alb.ingress.kubernetes.io/target-type: ip

Make sure you use v2.x of ALB Controller and set this label on the namespace where you are putting your pods: kubectl label namespace <your_namespace> elbv2.k8s.aws/pod-readiness-gate-inject=enabled

Reduce de-register delay by applying this to your Ingress: alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30 (by default it's 300 second which is too high)

Setup that sleep delay with preStop hook.

More information in this long article: https://easoncao.com/zero-downtime-deployment-when-using-alb-ingress-controller-on-amazon-eks-and-prevent-502-error/

This makes 502/504 go away completely.

I did like you described in your article and still have 502/504 issue when I curl my health endpoint every millisecond.

{"message":"pong"} Status code: 200 Latency: 0.090163s
502 Bad Gateway Status code: 502 Latency: 0.091805s
{"message":"pong"} Status code: 200 Latency: 0.104470s
{"message":"pong"} Status code: 200 Latency: 0.094271s
504 Gateway Time-out Status code: 504 Latency: 10.104198s
{"message":"pong"} Status code: 200 Latency: 0.083560s
{"message":"pong"} Status code: 200 Latency: 0.090153s
{"message":"pong"} Status code: 200 Latency: 0.080708s
502 Bad Gateway Status code: 502 Latency: 3.153344s
{"message":"pong"} Status code: 200 Latency: 0.088603s

unifyapps-saleem commented 4 months ago

Hi Team,

Have followed above steps, but no luck. am still facing 502. Any other workaround to fix this.

stepan-romankov commented 4 months ago

Hi Team,

Have followed above steps, but no luck. am still facing 502. Any other workaround to fix this.

check that you have "sh" in you container, e.g. if you are using gcr.io/distroless/base ensure that you use gcr.io/distroless/base:debug-nonroot-amd64 version which includes /busybox/sh. preStop setting you you kubernetes manifest should also be adjusted with "/busybox/sh"

unifyapps-saleem commented 4 months ago

Hey Stepan,

We are using node:lts-alpine & amazoncorretto:21-alpine-jdk images. sh is present in it

veludcx commented 1 month ago

Hi Team , Is there any solution for this problem, Addind readiness gate , reduce exponential backup , prestop hook none of them help in fix the issue

kubernetes-sigs / aws-load-balancer-controller

Getting 502/504 with Pod Readiness Gates during rolling updates #1719