pod termination might cause dropped connections

nirnanaaa commented 3 years ago

Describe the bug When a pod is Terminating it will receive a SIGTERM connection asking it to finish up work and after that it will proceed with deleting the pod. At the same time that the pod starts terminating, the aws-load-balancer-controller will receive the updated object, forcing it to start removing the pod from the target group and to initialize draining.

Both of these processes - the signal handling at the kubelet level and the removal of the Pods IP from the TG - are decoupled from one another and the SIGTERM might have been handled before or at the same time, that the target in the target group has started draining. As result the pod might be unavailable before the target group has even started its own draining process. This might result in dropped connections, as the LB is still trying to send requests to the properly shutdown pod. The LB will in-turn reply with 5xx responses.

Steps to reproduce

Provision an ingress with an ALB attached

alb.ingress.kubernetes.io/certificate-arn: xxx
alb.ingress.kubernetes.io/healthcheck-interval-seconds: "10"
alb.ingress.kubernetes.io/healthcheck-path: /
alb.ingress.kubernetes.io/healthcheck-timeout-seconds: "5"
alb.ingress.kubernetes.io/healthy-threshold-count: "2"
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=30
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/security-groups: xxx
alb.ingress.kubernetes.io/tags: xxx
alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=60
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/unhealthy-threshold-count: "2"

Create a service and pods (multiple ones through a deployment work best) for this ingress
(add some delay/load to the cluster, that will cause the AWS requests to be slower or have to be retried)
startup an HTTP benchmark to produce some artificial load
rollout a change to the deployment or just evict some pods

Expected outcome

The SIGTERM should only start after the ALB has either removed the target from the Target Group or at least doesn't send any new traffic to the pod after it received the SIGTERM signal

Environment

an HTTP service

All our ingresses have

AWS Load Balancer controller version v2.2.4

Kubernetes version

Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.20-eks-8c579e", GitCommit:"8c579edfc914f013ff48b2a2b2c1308fdcacc53f", GitTreeState:"clean", BuildDate:"2021-07-31T01:34:13Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Using EKS (yes), if so version ? 1.18 eks.8

Additional Context:

We've been relying on Pod-Graceful-Drain, which unfortunately forks this controller and intercepts and breaks k8s controller internals.

You can achieve a pretty good result as well using a sleep as preStop, but that's not reliable at all - due to the fact that it's just a guessing game if your traffic will be drained after X seconds - and requires statically linked binaries to be mounted in each container or the existence of sleep in the operating system.

I believe this is not only an issue to this controller, but to k8s in general. So any hints and already existing tickets would be very welcome.

johngmyers commented 1 year ago

@Ghilteras I'm saying the terminating server pod is capable of shutting down the long-lived connection.

Ghilteras commented 1 year ago

@Ghilteras I'm saying the terminating server pod is capable of shutting down the long-lived connection.

I don't see any way for this ingress controller (or others, as OP said this is not a specific issue of this controller) to be capable of shutting down long-lived connections, if you are aware of a way of doing this please share.

johngmyers commented 1 year ago

I wasn't talking about this ingress controller shutting down the connection.

Ghilteras commented 1 year ago

the GRPC server is already sending GOAWAYs, the issue is that when the clients reconnect they just end up going back to the Terminating pod, which should not be getting new traffic through the ingress, but it does and that's when we see the server throwing hundreds of connection reset by peer errors

johngmyers commented 1 year ago

@Ghilteras and thus the problem is with new connections coming in from the load balancer, as I previously stated.

Roberdvs commented 1 year ago

For anyone still struggling with the interaction between NLB and ingress-nginx-controller, this is the setup that seems to better mitigate the issues for us for HTTP traffic.

ingress-nginx chart values snippet:

service:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
    service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "https"
    service.beta.kubernetes.io/aws-load-balancer-type: nlb-ip
    service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*"
    service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: "deregistration_delay.timeout_seconds=0"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: 2
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: 5

# Mantain existing number of replicas at all times
updateStrategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0
# Add a pause to make time for the pod to be registered in the AWS NLB target group before proceeding with the next
# https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1834#issuecomment-781530724
# https://alexklibisz.com/2021/07/20/speed-limits-for-rolling-restarts-in-kubernetes#round-3-set-minreadyseconds-maxunavailable-to-0-and-maxsurge-to-1
minReadySeconds: 180
# Add sleep on preStop to allow for graceful shutdown with AWS NLB
# https://github.com/kubernetes/ingress-nginx/issues/6928
# https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2366#issuecomment-1118312709
lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 240; /wait-shutdown"]
# -- Maximum unavailable pods set in PodDisruptionBudget.
maxUnavailable: 1

Also, externalTrafficPolicy on the service of type LoadBalancer is set to Cluster.

With this setup rolling updates are reliable enough albeit reaaally slow, mostly due to having to account for https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1834#issuecomment-781530724

Hope it helps someone, and if anyone detects any issues, improvements, or recommendations, it would be good to know.

P.S: have yet to play around with pod readiness gates

Other issues referenced:

Sep. 2024 update:

According to https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1834#issuecomment-2359354913 NLBs have had improvements to the target registration behind-the-scenes so the minReadySeconds time can probably be reduced now to make rollouts faster.

venkatamutyala commented 1 year ago

Thanks @Roberdvs this is huge. I just came across the problem in my cluster while testing an upgrade and I was able to fix it within an hour by referencing your work. I have been running statuscake.com with 30sec checks to help verify your fix works as expected. Prior to the change I had a minute or two of downtime. After the change, I appear to be at a real 100% uptime. For anyone else coming across this, I implemented the following I am using the default NLB deployment method given I don't have AWS LB Controller installed. Please see @Roberdvs comment above if you are using the LB Controller.

        controller:
          service:
            annotations:
              service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
            type: "LoadBalancer"
            externalTrafficPolicy: "Local"
          # https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2366
          # https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2366#issuecomment-1788923154
          updateStrategy:
            rollingUpdate:
              maxSurge: 1
              maxUnavailable: 0
          # Add a pause to make time for the pod to be registered in the AWS NLB target group before proceeding with the next
          # https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1834#issuecomment-781530724
          # https://alexklibisz.com/2021/07/20/speed-limits-for-rolling-restarts-in-kubernetes#round-3-set-minreadyseconds-maxunavailable-to-0-and-maxsurge-to-1
          minReadySeconds: 180
          # Add sleep on preStop to allow for graceful shutdown with AWS NLB
          # https://github.com/kubernetes/ingress-nginx/issues/6928
          # https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2366#issuecomment-1118312709
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 240; /wait-shutdown"]

Note: I removedmaxUnavailable defined under the lifeCycle example from @Roberdvs as it appeared to be a default in my helm chart already

My versions: chart repo: https://kubernetes.github.io/ingress-nginx' name: ingress-nginx chart version: v4.8.3

michaelsaah commented 11 months ago

was thinking about this again... has anyone thought about modifying the ALB controller to set a finalizer on the pods within the TG, and using that as a hook to prevent termination until it's been successfully removed from the TG? this would hinge on the SIGTERM not getting sent until after all finalizers have been removed, which I would expect to be true but would need to verify.

mbyio commented 11 months ago

@michaelsaah That sounds like an interesting approach to me. I think it is worth someone trying it out. As you said, it really just depends on if k8s will wait to send the SIGTERM until finalizers are removed.

ibalat commented 7 months ago

any update? I have same problem and I think taht it can be related also autoscaler. Please can you view it: https://github.com/kubernetes/autoscaler/issues/6679

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

johndistasio commented 4 months ago

/remove-lifecycle stale

Roberdvs commented 2 months ago

For anyone still struggling with the interaction between NLB and ingress-nginx-controller, this is the setup that seems to better mitigate the issues for us for HTTP traffic.

ingress-nginx chart values snippet:

service:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
    service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "https"
    service.beta.kubernetes.io/aws-load-balancer-type: nlb-ip
    service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*"
    service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: "deregistration_delay.timeout_seconds=0"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: 2
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: 5

# Mantain existing number of replicas at all times
updateStrategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0
# Add a pause to make time for the pod to be registered in the AWS NLB target group before proceeding with the next
# https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1834#issuecomment-781530724
# https://alexklibisz.com/2021/07/20/speed-limits-for-rolling-restarts-in-kubernetes#round-3-set-minreadyseconds-maxunavailable-to-0-and-maxsurge-to-1
minReadySeconds: 180
# Add sleep on preStop to allow for graceful shutdown with AWS NLB
# https://github.com/kubernetes/ingress-nginx/issues/6928
# https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2366#issuecomment-1118312709
lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 240; /wait-shutdown"]
# -- Maximum unavailable pods set in PodDisruptionBudget.
maxUnavailable: 1

Also, externalTrafficPolicy on the service of type LoadBalancer is set to Cluster.

With this setup rolling updates are reliable enough albeit reaaally slow, mostly due to having to account for #1834 (comment)

Hope it helps someone, and if anyone detects any issues, improvements, or recommendations, it would be good to know.

P.S: have yet to play around with pod readiness gates

Other issues referenced:

* [Zero downtime upgrade kubernetes/ingress-nginx#6928](https://github.com/kubernetes/ingress-nginx/issues/6928)

* [Document zero-downtime deployment for IP targets #2131](https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2131)

According to https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1834#issuecomment-2359354913 NLBs have had improvements to the target registration behind-the-scenes so the minReadySeconds time can probably be reduced now to make rollouts faster.

msvticket commented 2 months ago

was thinking about this again... has anyone thought about modifying the ALB controller to set a finalizer on the pods within the TG, and using that as a hook to prevent termination until it's been successfully removed from the TG? this would hinge on the SIGTERM not getting sent until after all finalizers have been removed, which I would expect to be true but would need to verify.

I tried to find information about this and the clearest statement about this I found in kubernetes/kubernetes#73584. This issue was closed with the comment:

This is working as designed. Finalizers block deletion of the pod API object, not termination of the containers. You can use graceperiodseconds and prestophooks to give your containers time to shut down gracefully on termination.

sftim commented 2 months ago

I think ideally we look to enrich Gateway with feature here. The Service and Pod APIs are stable and making changes there is more costly; Gateway (or perhaps a new supporting API within the overall gateway family) has a flexibility to help us adapt.

youwalther65 commented 2 months ago

For NLB target registration see my comment here. I would suggest to leave deployment minReadySeconds at the default value of 0 and use the more robust approach of pod readiness gate instead. Even though target registration is much faster now, other things like ELB API throttling can make registration and healthiness slower under certain conditions (imagine you are running AWS Lb Ctlr with a couple of load balancers, dozens of target groups and each target group has lots of targets and ELB v2 API calls get throttled and AWS Lb Ctlr does backoff and retry via AWS Go SDK). So relying on "pod readiness" is safe because it will "sync" with the ALB/NLB.

kubernetes-sigs / aws-load-balancer-controller

pod termination might cause dropped connections #2366

Sep. 2024 update: