kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.93k stars 1.46k forks source link

pod termination might cause dropped connections #2366

Open nirnanaaa opened 2 years ago

nirnanaaa commented 2 years ago

Describe the bug When a pod is Terminating it will receive a SIGTERM connection asking it to finish up work and after that it will proceed with deleting the pod. At the same time that the pod starts terminating, the aws-load-balancer-controller will receive the updated object, forcing it to start removing the pod from the target group and to initialize draining.

Both of these processes - the signal handling at the kubelet level and the removal of the Pods IP from the TG - are decoupled from one another and the SIGTERM might have been handled before or at the same time, that the target in the target group has started draining. As result the pod might be unavailable before the target group has even started its own draining process. This might result in dropped connections, as the LB is still trying to send requests to the properly shutdown pod. The LB will in-turn reply with 5xx responses.

Steps to reproduce

Expected outcome

Environment

All our ingresses have

Additional Context:

We've been relying on Pod-Graceful-Drain, which unfortunately forks this controller and intercepts and breaks k8s controller internals.

You can achieve a pretty good result as well using a sleep as preStop, but that's not reliable at all - due to the fact that it's just a guessing game if your traffic will be drained after X seconds - and requires statically linked binaries to be mounted in each container or the existence of sleep in the operating system.

I believe this is not only an issue to this controller, but to k8s in general. So any hints and already existing tickets would be very welcome.

M00nF1sh commented 2 years ago

@nirnanaaa I'm not aware of any general solution that is better than a preStop hook. Since after all, the connection is controlled by both end of application(client/server) instead of LBs. There are also ideas like propagate LB's connection draining status back to pod/controller, but that is not widely supported and not reliable as well (finished draining state doesn't mean all existing connection is closed)

In my mind, it can be handled in a protocol specific way by client/server if you have control over server's code e.g. If the server receives a sigTerm, it can

Once we implemented above, then we only need to set a large terminationGracePeriodSeconds.

I have a sample app implemented above: https://github.com/M00nF1sh/ReInvent2019CON310R/blob/master/src/server.py#L24

nirnanaaa commented 2 years ago

Hey @M00nF1sh thanks for your input. I fear this is exactly what I thought was supposed to fix it. But thinking about it: if the server does all of this: what does prevent the LB from still sending traffic to the target - even for a brief period. Unfortunately We're hitting this exact scenario quite often.

So

Am I the only one that sees this as a problem? A preStop hook is not very reliable IMO as it's just eyeballing the timing issue, just like setTimeout doesn't fix a problem in javascript but only makes it less likely.

ejholmes commented 2 years ago

We also ran into this issue. The preStop hack generally works, but it's still a hack and it still seems to fail to synchronize the "deregistration before SIGTERM" properly. Even with this in place, we've seen intermittent cases where the container still seems to get a SIGTERM before the target is deregistered (possibly by the DeregisterTargets API call getting throttled).

This is a pretty serious issue for anyone trying to do zero-downtime deploys on top of Kubernetes with ALB's.

M00nF1sh commented 2 years ago

@nirnanaaa I see. The remaining gap is that the pods itself finishes draining too fast before LBs deregister them.

For this case, a sleep is indeed needed since we don't have any information available on when the LBs actually stops sending targets. (even when the targets shows as draining after the controller made the deregisterTargets call, it still takes time for the changes to be actually propagated to ELB's dataplane). If ELB have support for server-side retry, it could handle this nicely, but i'm not aware when that will be done.

nirnanaaa commented 2 years ago

I was just wondering whether this should be solved at a lower level - that's why I also opened an issue on k/k. This is not something just limited to this controller (although in cluster LBs are less likely to have this issue, it's still present).

And you're absolutely right. Client side retries would probably solve this. There are even some protocols like GRPC which could work around this problem, but the truth is that we cannot really control what's being run on the cluster itself - hence also my doubts about the use of sleep.

I thought about maybe having a more sophisticated preStop statically linked and mounted through a sidecar, that could delay the signal until a target has been removed from the LB, but fear it's also a hacky workaround that makes things even worse (especially considering API rate limits)

nirnanaaa commented 2 years ago

 Shutdown behavior

I've also drawn up a convenient picture detailling the issue further - the big, organge box is where things are happening decoupled from one another.

BrianKopp commented 2 years ago

This is indeed a problem that exists in kubernetes generally, not just ALB. We ran into this a lot using classic load balancers in proxy mode to nging ingress with external traffic policy local to get real IPs.

Using the v2 ELBs with ip target groups is a big improvement over the external traffic policy local mechanism requiring health check failures to get the node out of the instance list.

That being said, this still is an issue that can happen. Sleeps and prestop hooks are really the only game in town. I'm not aware of any kind of prestop gate like the readiness gate this controller can inject.

Is there a community binary that does this already OOTB? If not, that'd be a good little project.

Another alternative is to use a reverse proxy like ingress nginx be the target for your alb ingress instead of your application. Then your container lifecycle events for your alb targets will be much, much less frequent.

nirnanaaa commented 2 years ago

@BrianKopp we've thought about running a sidecar, which provides a statically linked binary to a common - in mem - volume, that the main container could use as preStop. I just couldn't come up with a generic solution to check for "is this pod still in any target group" - without either running into API throttles (and potentially blocking the main operation of the aws-lb-controller) - or completely breaking single responsibility principles.

BrianKopp commented 2 years ago

IIRC, sidecar for this sort of thing is a trap since a prestop hook interrupts the sigterm for its container, not all containers. Your http container would get its sigterm immediately.

I've actually begun thinking about starting a project to address this. My thought is to have an http service inside the cluster, say at drain-waiter. And you could call http://drain-waiter/drain-delay?ip=1.2.3.4&ingress=your-ingress-name&namespace=your-namespace&max-wait=60, and then you could have a prestop hook curling that url. We can get the hostname from the ingress object, and therefore filter the target groups very easily.

What do you think? Is this worth making a thing?

nirnanaaa commented 2 years ago

Oh I was not talking about processing the sigterm in the sidecar. If you read closely I only spoke about providing the binary ;)

For our use case even a simple HTTP query will not work if not done through some statically linked binary, since we cannot even rely on libc to be present.

BrianKopp commented 2 years ago

Ok I see what you were suggesting. I mean, if the requirement is that we cannot place any runtime demands on the http container, then yes we would need some kind of sidecar to be injected to provide some functionality, along with a mutating webhook to add the container and prestop hook, in case a prestop hook wasn't already present. That part seems like a bit of a minefield.

I was thinking about putting together something that would work for now until such a solution would be possible, if at all possible.

If one didn't want to place any dependency requirements on the main http container (eg curl), a prestop hook could wait for a signal over a shared volume from a lightweight curling sidecar

sftim commented 2 years ago

I think this is best addressed by extending the Pod API. Yes, that's not a trivial change. However, the Pod API is what the kubelet pays attention to. If you want the cluster to hold off sending SIGTERM then there needs to be a way in the API to explain why that's happening. This code doesn't have the levers to pull to make things change how we'd need.

Another option is to redefine SIGTERM to mean “get ready to shutdown but keep serving”. I don't think that's a helpful interpretation of SIGTERM though.

michaelsaah commented 2 years ago

question for those who've dealt with this: would you agree that a correct configuration to handle this issue looks like deregistrationDelay < sleep duration < terminationGracePeriod?

ejholmes commented 2 years ago

I think that's correct. This is generally what we've used in helm charts:

# ingress
alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds={{ .Values.deregistrationDelay }}
# pod
terminationGracePeriodSeconds: {{ add .Values.deregistrationDelay .Values.deregistrationGracePeriod 30 }}
command: ["sleep", "{{ add .Values.deregistrationDelay .Values.deregistrationGracePeriod }}"]

Where deregistrationGracePeriod provides a buffer for DeregisterTargets getting rate limited. We still have issues with this, but the buffer period does help.

mbyio commented 2 years ago

For NLBs in IP mode (everything should be similar for other LBs and modes):

ejholmes commented 2 years ago

deregistration takes at least 2 minutes regardless of the deregistration delay setting

That's an interesting finding. Is there some documentation you can point to that highlights this (assuming it's AWS side)?

mbyio commented 2 years ago

I believe it is all on AWS' side yes. No I couldn't find any documentation about this. All due respect to AWS engineers, their ELB documentation is horrible, missing a lot of important information. I found this by writing some programs that use raw TCP connections (so I can monitor every aspect) and manually triggering deregistration in various ELB configurations to record the timing. It consistently took 2 minutes.

BrianKopp commented 2 years ago

In your testing, did the target group state for the target show the ip as draining while it was receiving packets still? Did it receive packets after the target dropped completely out of the target group? I've got a project I'm working on to add a delay in the prestop hook to wait until the ip is out of the target group. Would that be helpful here?

mbyio commented 2 years ago

Yes, it did correctly show the IP on the target list as draining as soon as I requested deregistration. And when it was removed from the target list, it also stopped receiving new connections. So, if a tool was monitoring the target list and used that to delay sending SIGTERM to a pod until the pod is out of the target list (eg. using preStop), that would solve the problem. It would be the opposite of a readiness gate. I think one difficulty is AWS has some restrictive rate limits, so depending on your scale I don't think you can just have every pod hitting the API in the preStop hook.

dcarley commented 2 years ago

We were told this by AWS support:

Similarly, when you deregister a target from your Network Load Balancer, it is expected to take 90-180 seconds to process the requested deregistration, after which it will no longer receive new connections. During this time the Elastic Load Balancing API will report the target in 'draining' state. The target will continue to receive new connections until the deregistration processing has completed. At the end of the configured deregistration delay, the target will not be included in the describe-target-health response for the Target Group, and will return 'unused' with reason 'Target.NotRegistered' when querying for the specific target.

Sleeping for 180s (3m) still hasn't been reliable for us though, so we're currently at 240s (4m) 😭

nirnanaaa commented 2 years ago

we've actually started sharding our services across multiple aws accounts/eks clusters, just to lower the number of pending/throttled API requests and to increase the speed at which each controller can operate.

But then again: the probability of this error happening is still not zero. On a new AMI release we frequently experience dropped packages (even with X seconds of sleep as preStop)

mbyio commented 2 years ago

I think the controller should automatically add a preStop hook to pods, which waits until the controller indicates it is safe to start terminating. That would nip this in the bud once and for all.

mtparet commented 2 years ago

If you are using ingress-nginx you can define shutdown-grace-period https://github.com/kubernetes/ingress-nginx/issues/6928#issuecomment-1143408093

MatthiasWinzeler commented 2 years ago

we wondered that - according to the comments in this thread - it would take up to minutes for the ALB/NLB to stop sending new requests, so we reached out to the AWS support for an official solution. Outcome:

Also we were told that the issue (ALB/NLB sending requests to draining targets) should be fixed in the future, so we expect to eventually be able to decrease the pre stop sleep to a lower number.

Indigenuity commented 2 years ago

Can confirm that even after targets are marked as draining in an ALB target group, they still receive traffic. In a test where I sent a constant load to a 2-pod hello-world web server with PodDisruptionBudget minAvailable=1, the timeline I saw was:

So the ALB continues to send traffic as normal (not even a reduced amount of traffic AFAICT) to a target after it enters the draining state, but ceases that traffic before Deregistration completes. So there's some internal event that we don't get to see.

This is the timeline for requests that take <10ms to process; obviously things are different for longer-lived requests.

eric-kinsa commented 2 years ago

That test doesn't account for existing connections that need to finish during the draining process. If a request takes several seconds, (lets say a visual query app) and is happening during a deployment, the possibility is high that the request is going to be killed when the pod its running on is killed instead of letting it cleanly finish.

gsusI commented 2 years ago

The same issue happens when using the kind Service of type Load Balancer, which generates a Network Load Balancer (NLB)

JamieSinn commented 2 years ago

We've also seen this happen as a result of an EC2 Spot based node being killed/taken. The pod gets removed, and the ALB (via ingress) fails to remove the target in time.

sftim commented 2 years ago

We've also seen this happen as a result of an EC2 Spot based node being killed/taken. The pod gets removed, and the ALB (via ingress) fails to remove the target in time.

Is that Pod removal happening due to node problem detector / EC2 node termination handler, whilst the kubelet is still alive?

I'm wondering in particular about what the timeline is for that removal and what finalizers might or might not be set on the Pod during its deletion.

JamieSinn commented 2 years ago

Is that Pod removal happening due to node problem detector / EC2 node termination handler, whilst the kubelet is still alive?

Technically yes- the node is killed because it is marked unhealthy - but the reason it's marked as unhealthy is because AWS is taking the instance back into their pool. BidEvictedEvent is the exact message of why this happens - and it triggers a 2 minute warning on the removal of node as a whole - so instances start to shut down, which then (either doesn't at all, or is delayed) doesn't properly cascade from EKS -> ALB.

Timeline wise - I'll see if i can find some logs and get a timeline - but we seem to have about 5-10s of 502 errors because the target is gone but the ALB doesn't know/hasn't removed it yet.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

dan-ih commented 1 year ago

/remove-lifecycle stale

raress96 commented 1 year ago

Can confirm that even after targets are marked as draining in an ALB target group, they still receive traffic. In a test where I sent a constant load to a 2-pod hello-world web server with PodDisruptionBudget minAvailable=1, the timeline I saw was:

  • 0 seconds: kubectl drain
  • 1 second: pod 1 enters Terminating, receives SIGTERM
  • 2 seconds: target marked as draining in target group and Deregistration starts
  • 2 seconds: replacement pod 1 enters ContainerCreating
  • 2-15seconds: traffic sent to ALB receives 504 and 502 responses mixed in with 200
  • 4 seconds: pod 1 SIGKILL
  • 17 seconds: replacement pod 1 starts and immediately receives traffic, starts returning 200
  • 19 seconds: traffic finally stops going to the terminated pod, responses are all 200
  • 21 seconds: pod 2 starts following the same timeline as pod 1, receives SIGTERM followed by SIGKILL
  • 21-40 seconds: traffic sent to ALB receives 504 and 502 responses mixed in with 200
  • 41 seconds: responses once again go back to all 200
  • 193 seconds: old pod 1 Deregistration finishes
  • 221 seconds: old pod 2 Deregistration finishes

So the ALB continues to send traffic as normal (not even a reduced amount of traffic AFAICT) to a target after it enters the draining state, but ceases that traffic before Deregistration completes. So there's some internal event that we don't get to see.

This is the timeline for requests that take <10ms to process; obviously things are different for longer-lived requests.

I seem to have almost exactly this behaviour.

When a Pod is in Terminating state, it still receives traffic since I have requests with errors. And in the AWS ALB, the target is in draining state... After for 10-30 seconds it stops receiving traffic and everything is fine, even though it stays in Terminating for another 30 seconds.

I have put a sleep 60 in the preStop lifecycle (for both an nginx and php-fpm containers inside a deployment pod) and the sleep seems to not do anything, requests are still failing...

And another behaviour that I saw which is really strange is that in a rolling update, there is a point where my ALB had one pod in initial state, 2 pods in draining but no pod in Healthy state which shouldn't happen at all. And it seems that the controller also doesn't take into consideration the startupProbe, adding a Pod to the ALB before that is finished.

sftim commented 1 year ago

Startup probes allow for a longer (or shorter) startup time before regular liveness probes kick in. If you want to define a probe that must pass before an endpoint is treated as healthy for traffic to come in, define a readiness probe.

Also see https://kubernetes.io/blog/2022/12/30/advancements-in-kubernetes-traffic-engineering/

There are some relevant features that became stable in the v1.26 release, and another that moved to beta.

cayla commented 1 year ago

And another behaviour that I saw which is really strange is that in a rolling update, there is a point where my ALB had one pod in initial state, 2 pods in draining but no pod in Healthy state which shouldn't happen at all. And it seems that the controller also doesn't take into consideration the startupProbe, adding a Pod to the ALB before that is finished.

I also recommend reading up on pod readiness gates

rofreytag commented 1 year ago

during deregistration, the pod may still receive new requests/connections even though it is in the process of deregistering

I think this is the main culprit. I don't understand how, at-all, the NLB/ALBs would forward new connections to a draining target. I am currently moving from Classic LB to NLB in front of ingress-nginx and in the process was trying externalTrafficPolicy: Local on the ingress-controller service. This setup results in dropped connections during an nginx upgrade. Something I did not experience with CLB + externalTrafficPolicy: Cluster.

Now I also need to improve single-AZ disaster recovery, which means externalTrafficPolicy: Local is the better behavior.

AWS Engineers: I am expecting a load balancer that is "draining" to not forward new connections to a target. Is this at all possible?

EDIT: thank @cayla for your hint at pod readiness gates - I will try them out!

vchirikov commented 1 year ago

with pod readiness gates & nlb I still have dropped connections at draining, you have to add preStop hook anyway, it would be great if aws load balancer controller could provide an endpoint to add in prestop hook, but as workaround you can use sleep.

mattjamesaus commented 1 year ago

We came across this issue and found ours to be a combination of both the Cluster Autoscaler (CA) and the Load Balancer Controller working in tandem when external traffic policy was set to cluster. We'd have nodes getting ready to be scaled in (being pretty much empty except for kube-proxy) then the CA would kick in and taint the node and terminate it well before it was deregistered by the LB Controller.

This would result in-flight requests being dropped from kube-proxy to another node (that ran our HAPRoxy pods). It appears now tho that the latest CA introduces a --node-delete-delay-after-taint which will gracefully wait an amount of seconds before killing the node. I'm about to test this but my assumption here is that if this delay is set to just longer than your deregistration timeout the node will be kept around long enough to be successfully pulled out of the target group and gracefully terminate all the in-flight connections going to kube-proxy.

This would still require adequate pod termination for whatever service gets the traffic but most handle that perfectly well already.

It seems like this is a pretty prevelant problem for what i assume is a very popular setup - if it does fix the issue we reaelly should call this out in the documentation that administrators need to be cognizant of the cluster autoscaler and LB controller potentially dropping requests in the externaltrafficpolicy cluster mode.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

johndistasio commented 1 year ago

/remove-lifecycle stale

Ghilteras commented 1 year ago

I think the controller should automatically add a preStop hook to pods, which waits until the controller indicates it is safe to start terminating. That would nip this in the bud once and for all.

Doesn't seem to be working though as the pod dies immediately instead of gracefully shutdown, we had to remove the preStop hook and sleep 30s instead to make all upstream GRPC errors go away

mbyio commented 1 year ago

I think the controller should automatically add a preStop hook to pods, which waits until the controller indicates it is safe to start terminating. That would nip this in the bud once and for all.

Doesn't seem to be working though as the pod dies immediately instead of gracefully shutdown, we had to remove the preStop hook and sleep 30s instead to make all upstream GRPC errors go away

This automatic preSop hook was a suggestion for what they should implement, it isn't implemented yet.

mbyiounderdog commented 1 year ago

If you have a preStop hook that waits for shutdown, then it was not added by the controller.

Ghilteras commented 1 year ago

when I say wait for the shutdown I just mean a simple sleep and sure there are less errors using the sleep preStop, but as the OP said it's a hack, there's no way to make the controller gracefully terminate and the OP is also correct in saying that this is not a specific aws-load-balancer controller issue b/c we have exactly the same problem on the nginx ingress controller deployed on bare metal

johngmyers commented 1 year ago

I can't think of a good way for LBC to communicate to a preStop that the pod has been deregistered. The pod would have to be in a ServiceAccount that has RBAC to watch some resource that LBC would update or there'd have to be some network connection opened up between LBC and the pod.

Ghilteras commented 1 year ago

I think the issue here is just for long lived connections b/c the controller does not sever them when it goes into draining mode, so clients keep obsessing with them even when the sleep/grace period is over, hence the errors.

I don't know if there is a way to force the controller to sever the long lived connection when entering draining mode, so that when the clients would reconnect to the ingress the controller should not pick the pods in Terminating state and the newly re-established connections would be healthy

johngmyers commented 1 year ago

@Ghilteras no, the pod knows about long-lived connections, is perfectly capable of closing them itself, and knows when they have gone away. The problem is new connections that keep coming in from the load balancer. The pod does not know when the load balancer has finished deregistering it and thus there will no longer be any more new incoming connections.

mbyio commented 1 year ago

The pod would have to be in a ServiceAccount that has RBAC to watch some resource that LBC would update or there'd have to be some network connection opened up between LBC and the pod.

I'm no longer working at the company where I needed this day-to-day. However, we would have been willing to deal with a lot of setup, including service accounts etc, in order to have an automated fix for this bug. It was a major pain point. As you can see in these comments, it is also hard for many people to understand, and therefore, hard to work around.

Ghilteras commented 1 year ago

@Ghilteras no, the pod knows about long-lived connections, is perfectly capable of closing them itself, and knows when they have gone away. The problem is new connections that keep coming in from the load balancer. The pod does not know when the load balancer has finished deregistering it and thus there will no longer be any more new incoming connections.

How do you close and re-open a GRPC connection from the client to force it to dial another non Terminating pod? I don't see anything like this in the grpc libraries. If your client gets a GOAWAY all it can do is retry it, which will leverage the existing long lived connection to the Terminating pod, which means it will keep failing. There is no way of automatically handle this from client without wrapping it with custom logic that servers the connection on GOAWAY/draining errors from servers. All this because NGINX cannot close the long lived connection on its side when it's gracefully shutting down? I think I'm missing something here

mbyio commented 1 year ago

@Ghilteras You may be looking at the wrong Github issue. This issue is about a case where AWS load balancers can send new requests to terminating pods that have already stopped accepting new requests. NGINX is not involved.

Edit - oh I see, you mentioned the nginx ingress controller having the same problem. This issue is regarding connections from a load balancer to server pods. If you're using GRPC to connect to a load balancer, NGINX or an ALB, then that load balancer is intercepting the connections and doing load balancing. So there isn't really anything you can do on the client side to fix this.