kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.94k stars 1.46k forks source link

ALB sending requests to pods after ingress controller deregisters them leading to 504s #1064

Closed jorihardman closed 4 years ago

jorihardman commented 5 years ago

I have the following ingress controller defined with the relevant bits being that it is target-type ip with a 30 second deregistration_delay.

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: staging-ingress
  namespace: "default"
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/certificate-arn: REDACTED
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/ssl-policy: ELBSecurityPolicy-TLS-1-2-2017-01
    alb.ingress.kubernetes.io/healthcheck-path: /up
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
    alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30
    alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=30,access_logs.s3.enabled=true,access_logs.s3.bucket=REDACTED,access_logs.s3.prefix=REDACTED
    alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
spec:
  rules:
    - host: REDACTED
      http:
        paths:
          - path: /*
            backend:
              serviceName: ssl-redirect
              servicePort: use-annotation
          - path: /
            backend:
              serviceName: graphql-api
              servicePort: 80

When I delete a pod, either manually or as part of a rolling deploy, I see 504s returned from ALB. 504s are returned when ALB cannot form a connection to its target within 10s. Here is one such message from the ALB logs:

https 2019-11-06T01:35:39.438256Z app/bd528925-default-stagingin-1b90/1a99d4560435cd18 54.248.220.14:24888 172.18.158.197:8080 -1 -1 -1 504 - 229 303 "GET REDACTED:443/up HTTP/1.1" "Amazon-Route53-Health-Check-Service (ref af4d4b0b-e40c-4f18-a18f-12f472889080; report http://amzn.to/1vsZADi)" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:us-east-1:REDACTED:targetgroup/bd528925-7628515b83cd9ca7cdb/50afd948cb0a6587 "Root=1-5dc22361-b0d66bf23f845202c2c390ec" "REDACTED" "arn:aws:acm:us-east-1:REDACTED:certificate/REDACTED" 3 2019-11-06T01:35:29.436000Z "forward" "-" "-" "172.18.158.197:8080" "-"

There's a lot going on there, but the important part is that the request is received at 2019-11-06T01:35:29.436000Z and the error is emitted 10s later at 2019-11-06T01:35:39.438256Z.

I investigated the ingress controller logs and I can see that the pod in question, 172.18.158.197:8080, is deregistered at 2019-11-06T01:35:25.891546Z 4 seconds prior to when the above request is received.

I1106 01:35:25.891546       1 targets.go:95] default/staging-ingress: Removing targets from arn:aws:elasticloadbalancing:us-east-1:REDACTED:targetgroup/bd528925-7628515b83cd9ca7cdb/50afd948cb0a6587: 172.18.158.197:8080

My understanding is that once a target is set to "deregistering" ALB will not forward it anymore requests. It's unclear to me how this requests seems to be breaking that rule - any thoughts?

jorihardman commented 5 years ago

I should add that I've implemented a preStop delay on my pod to allow for deregistration to occur prior to pod shut down. This seems to be functioning correctly. I can see the deregistration in the controller logs, and several seconds later I see a SIGTERM on my pod. The issue seems to be between ALB and the ingress controller.

rparsonsbb commented 5 years ago

I'm definitely experiencing the same thing. Powershell preStop commands don't seem to like sleeping first either so had to really rig that up.

Is there any insight as to why terminated pods still get sent requests? Happens if the type is IP or instance.

nicholasgcoles commented 4 years ago

@jorihardman is this happening every single time you update a deployment? Or only sometimes.

jorihardman commented 4 years ago

@nicholasgcoles To be honest when I was investigating this issue, my sample size was only about 5 deployments, each rolling 4 pods. I saw at least 1 504 for each of these. I didn't have constant traffic for each deploy, so honestly, it wasn't a perfect experimental setup. Still, the logs seemed to me to be a smoking gun for what I would expect to be an impossible situation, so I figured it was worth raising the question. Happy to investigate further with some guidance.

mailjunze commented 4 years ago

This looks similar to #905

krunalnsoni commented 4 years ago

We are able to reproduce this 100% of time with target-type as ip. We added 30 seconds sleep delay for preStop which resolved 502s. However, 504s happen even after target has been removed from ALB. We have verified that 504s are caused by ALB forwarding requests to target that was removed.

jorihardman commented 4 years ago

Checking back in here. I think I was able to resolve this issue with the preStop sleep workaround to achieve 100% availability though deployments.

Couple things I changed from my initial config:

# Extend the pods shutdown grace period from the default of 30s to 60s.
# This goes in the pod template spec.
terminationGracePeriodSeconds: 60

# Increase the sleep before SIGTERM to 25s. I had this as 5s previously and it wasn't enough.
lifecycle:
  preStop:
    exec:
      command: ["sleep", "25"]

Extending the sleep time allows ALB to send a few requests even after the dereregistration time without the pod rejecting them. My pod needs max 30s to gracefully answer all requests and terminate, so the 25s sleep + 30s = 55s. The extended terminationGracePeriodSeconds of 60s allows that whole shutdown process to happen without killing the pod.

I used https://github.com/JoeDog/siege to send constant concurrent requests to the load balancer while deploying and achieved 100% availability.

jorihardman commented 4 years ago

I'm going to close this issue as it seems to be an AWS problem and not related to the controller. The workaround is to add sufficient preStop sleep to allow ALB to complete deregistration. I did 25s which might be overkill but seemed to mitigate the problem.

jorihardman commented 4 years ago

Previous testing was with our staging environment. When I shipped the changes to production, I still see 504s from ALB sending requests to terminating pods.

rparsonsbb commented 4 years ago

Agreed. I'm running windows containers but all of the theory is the same. Actually using a 60s sleep at this point and still getting 504s from the ALB during rollout.

            preStop:
              exec:
                command:
                - powershell.exe
                - -c
                - echo "Stopping container"; Start-Sleep 60; Stop-Website "Main"
jorihardman commented 4 years ago

Strangely, I haven't gotten a 504 during deploys for the last 48 hours and I've done 3 or 4 deploys at peak traffic times. I did tweak the readiness checks on my pod to make them ready quicker (lowered initialDelaySeconds). Perhaps that has allowed them to become healthy on the LB faster.

jorihardman commented 4 years ago

Naturally, as soon as I signal success, I see another spurt of 502s. There's definitely still a deregistration race condition here. This time I got 502s due to a pod terminating prior to being deregistered. This seems to be pretty non-deterministic at this point. Sometimes alb-ingress-controller keeps up with changing targets, and sometimes it doesn't.

mohitanchlia commented 4 years ago

Any ETA on this?

KIVagant commented 4 years ago

For those who use the setup:

...here's an example of how to avoid the problem:

https://github.com/linkerd/linkerd2/issues/3747#issuecomment-557593979

I am not an expert in ALB internals but I feel it happens because of cross-region nature (and, maybe, multi-tenant). It takes time to propagate changes to all instances, so ALB will send some traffic for a while even after a target is requested to be deregister. So the overall "sleep" configuration for preStop hook should be calculated based on:

aditya-pr commented 4 years ago

Have you guys tried --feature-gates=waf=false? This fixed the delay issue for us. You can see in following graph, most of retires are for waf[1] which seems to backup the ratelimit queue. You can collect queue related metrics by adding MetricsBindAddress: ":8080" here. One can see longest running process is reaching upwards of 300 sec.

Screenshot from 2020-01-21 16-01-26

After disabling waf, longest running process was down to max of 5 sec (still needs sleep of 10 sec in prestop).

I have verified that the issue exists in v1.1.5. Let me know if you need any further info or help fixing this. [1] ingress aws api metrics collected with prometheus.io/port: '10254'

KIVagant commented 4 years ago

@aditya-pr , yes, and it is not enough to get the smooth rolling update.

jorihardman commented 4 years ago

@aditya-pr Amazing dashboard you've got there. Wish I had that when I was digging into this issue. To clarify, are you saying that v1.1.5 still experiences WAF queue lag even with no WAF rules?

aditya-pr commented 4 years ago

are you saying that v1.1.5 still experiences WAF queue lag even with no WAF rules?

Yes, there is lag even after disabling waf. There are other aws api calls which are still being retried. Adding all endpoints of an ingress to queue whenever any endpoint updates must be making this worse. Deregistration delay of even 5 seconds means elb is sending to a terminated pod, ie 504 Gateway Timeout error. Screenshot from 2020-02-05 14-13-14

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

erks commented 4 years ago

/remove-lifecycle stale

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 4 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 4 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 4 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/aws-alb-ingress-controller/issues/1064#issuecomment-705594111): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
Elyahou commented 3 years ago

Having the same issue here, I have a prestop hook of 60s sleep, but I still get 504 from ALB after deleting a pod manually or scaled down the deployment, the timeout requests are always received 8s after the deregistration started, and are timed out 10s after that so 18s after the deregistration started. At this time, the pod still works and reply correctly to requests that I send manually to it directly...

Anyone here have found a solution to this ?

rparsonsbb commented 3 years ago

@Elyahou I don't feel too bad because this is an ongoing issue of over a year at this point but I moved on to kong-ingress-controller and have been very happy with it. The things that ALB handled nicely - for me primarily was the AWS certs - have now all been handled by adding a couple more pieces, namely cert-manager and external-dns so that I no longer worry about the ALB lifecycle conflicts.

ALB ingress seemed like an awesome idea and was dead-simple to use, but the inability to sync up the lifecycles b/t pod and ALB delivery just became too much.

blhagadorn commented 3 years ago

Any update on this? This should be reopened given the comments here, IMO.

blang commented 3 years ago

Is it possible that lb controller removes the security group entries despite target is still deregistering? Will investigate this further next couple of weeks

kevin-lindsay-1 commented 2 years ago

I think I might be currently encountering this issue. I'll have to do further tests to verify, but it seems like this issue should be reopened.

123BLiN commented 1 year ago

Still an issue btw, for those who come here after me, the solution AWS suggests here is to use healthchecks looks like and do not rely on target deregistration https://aws.amazon.com/blogs/containers/how-to-rapidly-scale-your-application-with-alb-on-eks-without-losing-traffic/ so pod should just start to fail healthchecks on SIGTERM while still serving the traffic for some time, it that case PreStop hook is not required so that also should improve rolling update time a bit in theory

davidroth commented 12 months ago

@123BLiN According to the sample from aws, pre-stop hook ist still required.

I am curious if the "health-check + sleep xx" hook has any benefit over a simple "sleep xx" hook. According to my tests, there is no difference.

armenr commented 12 months ago

This is an issue for us too. A BIG one.

...It's incredible to me how this has been allowed to go this long without concrete fix or resolution 😂 . Talk about "Bias for Action" and "Ownership" 🎃

AWS philosophizes and writes literature on the 6 Well-Architected Pillars.

Also releases a Load Balancer Ingress Controller that regularly throws 50xs, because it doesn't work as advertised/as expected, even though the customer using their tool is doing everything in exactly the way they're being told to.

...And then lets the issue just kinda linger and languish for years...and then tells the customer that the "fix" is just a bunch of workarounds that impact the customer's kubernetes cluster in other ways...like delayed pod termination, which then impacts the scale-speed and cost characteristics of the cluster. 👍🏼

dusansusic commented 11 months ago

@123BLiN According to the sample from aws, pre-stop hook ist still required.

I am curious if the "health-check + sleep xx" hook has any benefit over a simple "sleep xx" hook. According to my tests, there is no difference.

I guess this isn't health-check but setting healthy to the unhealthy so health-check start to fail.

sed -i 's/health/nothealthy/g'

I believe that remaining 50x errors are caused because of this, as he wrote in the sentenced before Conclusion chapter.

armenr commented 11 months ago

We found a useful pattern for this, although it's fairly bespoke (using s6-overlay). I'm happy to provide a sample implementation once I test it a little while longer.

Anyone on this thread using s6-overlay, or at least willing to? :) +1 so I know whether I should waste my breath or not.

debu99 commented 3 weeks ago

We found a useful pattern for this, although it's fairly bespoke (using s6-overlay). I'm happy to provide a sample implementation once I test it a little while longer.

Anyone on this thread using s6-overlay, or at least willing to? :) +1 so I know whether I should waste my breath or not.

could you share more details?

leonardocaylent commented 5 days ago

/remove-lifecycle rotten

leonardocaylent commented 5 days ago

/reopen

k8s-ci-robot commented 5 days ago

@leonardocaylent: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1064#issuecomment-2456351620): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
leonardocaylent commented 5 days ago

Why was this issue mark as closed if this is still happening after 5 years?

armenr commented 4 days ago

It definitely continues to happen.

The posted blog/article regarding how to do this uses an example where you use sed to modify running code in order to "kill" a server inside the pod, prior to deregistration, so health checks start to fail...and it's honestly not a good solution.

We worked around it a different way, but it required non-trivial effort and a change to the way we fundamentally configure and manage pods and the application(s) inside them.