Closed jorihardman closed 4 years ago
I should add that I've implemented a preStop delay on my pod to allow for deregistration to occur prior to pod shut down. This seems to be functioning correctly. I can see the deregistration in the controller logs, and several seconds later I see a SIGTERM on my pod. The issue seems to be between ALB and the ingress controller.
I'm definitely experiencing the same thing. Powershell preStop commands don't seem to like sleeping first either so had to really rig that up.
Is there any insight as to why terminated pods still get sent requests? Happens if the type is IP or instance.
@jorihardman is this happening every single time you update a deployment? Or only sometimes.
@nicholasgcoles To be honest when I was investigating this issue, my sample size was only about 5 deployments, each rolling 4 pods. I saw at least 1 504 for each of these. I didn't have constant traffic for each deploy, so honestly, it wasn't a perfect experimental setup. Still, the logs seemed to me to be a smoking gun for what I would expect to be an impossible situation, so I figured it was worth raising the question. Happy to investigate further with some guidance.
This looks similar to #905
We are able to reproduce this 100% of time with target-type as ip
. We added 30 seconds sleep delay for preStop
which resolved 502s. However, 504s happen even after target has been removed from ALB. We have verified that 504s are caused by ALB forwarding requests to target that was removed.
Checking back in here. I think I was able to resolve this issue with the preStop
sleep workaround to achieve 100% availability though deployments.
Couple things I changed from my initial config:
# Extend the pods shutdown grace period from the default of 30s to 60s.
# This goes in the pod template spec.
terminationGracePeriodSeconds: 60
# Increase the sleep before SIGTERM to 25s. I had this as 5s previously and it wasn't enough.
lifecycle:
preStop:
exec:
command: ["sleep", "25"]
Extending the sleep time allows ALB to send a few requests even after the dereregistration time without the pod rejecting them. My pod needs max 30s to gracefully answer all requests and terminate, so the 25s sleep + 30s = 55s. The extended terminationGracePeriodSeconds
of 60s allows that whole shutdown process to happen without killing the pod.
I used https://github.com/JoeDog/siege to send constant concurrent requests to the load balancer while deploying and achieved 100% availability.
I'm going to close this issue as it seems to be an AWS problem and not related to the controller. The workaround is to add sufficient preStop sleep to allow ALB to complete deregistration. I did 25s which might be overkill but seemed to mitigate the problem.
Previous testing was with our staging environment. When I shipped the changes to production, I still see 504s from ALB sending requests to terminating pods.
Agreed. I'm running windows containers but all of the theory is the same. Actually using a 60s sleep at this point and still getting 504s from the ALB during rollout.
preStop:
exec:
command:
- powershell.exe
- -c
- echo "Stopping container"; Start-Sleep 60; Stop-Website "Main"
Strangely, I haven't gotten a 504 during deploys for the last 48 hours and I've done 3 or 4 deploys at peak traffic times. I did tweak the readiness checks on my pod to make them ready quicker (lowered initialDelaySeconds
). Perhaps that has allowed them to become healthy on the LB faster.
Naturally, as soon as I signal success, I see another spurt of 502s. There's definitely still a deregistration race condition here. This time I got 502s due to a pod terminating prior to being deregistered. This seems to be pretty non-deterministic at this point. Sometimes alb-ingress-controller keeps up with changing targets, and sometimes it doesn't.
Any ETA on this?
For those who use the setup:
...here's an example of how to avoid the problem:
https://github.com/linkerd/linkerd2/issues/3747#issuecomment-557593979
I am not an expert in ALB internals but I feel it happens because of cross-region nature (and, maybe, multi-tenant). It takes time to propagate changes to all instances, so ALB will send some traffic for a while even after a target is requested to be deregister. So the overall "sleep" configuration for preStop hook should be calculated based on:
Have you guys tried --feature-gates=waf=false
? This fixed the delay issue for us. You can see in following graph, most of retires are for waf
[1] which seems to backup the ratelimit queue. You can collect queue related metrics by adding MetricsBindAddress: ":8080"
here. One can see longest running process is reaching upwards of 300 sec.
After disabling waf, longest running process was down to max of 5 sec (still needs sleep of 10 sec in prestop
).
I have verified that the issue exists in v1.1.5
. Let me know if you need any further info or help fixing this.
[1] ingress aws api metrics collected with prometheus.io/port: '10254'
@aditya-pr , yes, and it is not enough to get the smooth rolling update.
@aditya-pr Amazing dashboard you've got there. Wish I had that when I was digging into this issue. To clarify, are you saying that v1.1.5
still experiences WAF queue lag even with no WAF rules?
are you saying that v1.1.5 still experiences WAF queue lag even with no WAF rules?
Yes, there is lag even after disabling waf. There are other aws api calls which are still being retried. Adding all endpoints of an ingress to queue whenever any endpoint updates must be making this worse. Deregistration delay of even 5 seconds means elb is sending to a terminated pod, ie 504 Gateway Timeout error.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
@fejta-bot: Closing this issue.
Having the same issue here, I have a prestop hook of 60s sleep, but I still get 504 from ALB after deleting a pod manually or scaled down the deployment, the timeout requests are always received 8s after the deregistration started, and are timed out 10s after that so 18s after the deregistration started. At this time, the pod still works and reply correctly to requests that I send manually to it directly...
Anyone here have found a solution to this ?
@Elyahou I don't feel too bad because this is an ongoing issue of over a year at this point but I moved on to kong-ingress-controller and have been very happy with it. The things that ALB handled nicely - for me primarily was the AWS certs - have now all been handled by adding a couple more pieces, namely cert-manager and external-dns so that I no longer worry about the ALB lifecycle conflicts.
ALB ingress seemed like an awesome idea and was dead-simple to use, but the inability to sync up the lifecycles b/t pod and ALB delivery just became too much.
Any update on this? This should be reopened given the comments here, IMO.
Is it possible that lb controller removes the security group entries despite target is still deregistering? Will investigate this further next couple of weeks
I think I might be currently encountering this issue. I'll have to do further tests to verify, but it seems like this issue should be reopened.
Still an issue btw, for those who come here after me, the solution AWS suggests here is to use healthchecks looks like and do not rely on target deregistration https://aws.amazon.com/blogs/containers/how-to-rapidly-scale-your-application-with-alb-on-eks-without-losing-traffic/ so pod should just start to fail healthchecks on SIGTERM while still serving the traffic for some time, it that case PreStop hook is not required so that also should improve rolling update time a bit in theory
@123BLiN According to the sample from aws, pre-stop hook ist still required.
I am curious if the "health-check + sleep xx" hook has any benefit over a simple "sleep xx" hook. According to my tests, there is no difference.
This is an issue for us too. A BIG one.
...It's incredible to me how this has been allowed to go this long without concrete fix or resolution 😂 . Talk about "Bias for Action" and "Ownership" 🎃
AWS philosophizes and writes literature on the 6 Well-Architected Pillars.
Also releases a Load Balancer Ingress Controller that regularly throws 50x
s, because it doesn't work as advertised/as expected, even though the customer using their tool is doing everything in exactly the way they're being told to.
...And then lets the issue just kinda linger and languish for years...and then tells the customer that the "fix" is just a bunch of workarounds that impact the customer's kubernetes cluster in other ways...like delayed pod termination, which then impacts the scale-speed and cost characteristics of the cluster. 👍🏼
@123BLiN According to the sample from aws, pre-stop hook ist still required.
I am curious if the "health-check + sleep xx" hook has any benefit over a simple "sleep xx" hook. According to my tests, there is no difference.
I guess this isn't health-check but setting healthy
to the unhealthy
so health-check start to fail.
sed -i 's/health/nothealthy/g'
I believe that remaining 50x errors are caused because of this, as he wrote in the sentenced before Conclusion chapter.
We found a useful pattern for this, although it's fairly bespoke (using s6-overlay). I'm happy to provide a sample implementation once I test it a little while longer.
Anyone on this thread using s6-overlay, or at least willing to? :) +1 so I know whether I should waste my breath or not.
We found a useful pattern for this, although it's fairly bespoke (using s6-overlay). I'm happy to provide a sample implementation once I test it a little while longer.
Anyone on this thread using s6-overlay, or at least willing to? :) +1 so I know whether I should waste my breath or not.
could you share more details?
/remove-lifecycle rotten
/reopen
@leonardocaylent: You can't reopen an issue/PR unless you authored it or you are a collaborator.
Why was this issue mark as closed if this is still happening after 5 years?
It definitely continues to happen.
The posted blog/article regarding how to do this uses an example where you use sed
to modify running code in order to "kill" a server inside the pod, prior to deregistration, so health checks start to fail...and it's honestly not a good solution.
We worked around it a different way, but it required non-trivial effort and a change to the way we fundamentally configure and manage pods and the application(s) inside them.
I have the following ingress controller defined with the relevant bits being that it is target-type ip with a 30 second deregistration_delay.
When I delete a pod, either manually or as part of a rolling deploy, I see 504s returned from ALB. 504s are returned when ALB cannot form a connection to its target within 10s. Here is one such message from the ALB logs:
There's a lot going on there, but the important part is that the request is received at 2019-11-06T01:35:29.436000Z and the error is emitted 10s later at 2019-11-06T01:35:39.438256Z.
I investigated the ingress controller logs and I can see that the pod in question, 172.18.158.197:8080, is deregistered at 2019-11-06T01:35:25.891546Z 4 seconds prior to when the above request is received.
My understanding is that once a target is set to "deregistering" ALB will not forward it anymore requests. It's unclear to me how this requests seems to be breaking that rule - any thoughts?