linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.69k stars 1.28k forks source link

Do we handle shutdown incorrectly? #10379

Open olix0r opened 1 year ago

olix0r commented 1 year ago

Someone recently brought this thread to my attention. Tim argues that applications must continue to handle new connections/requests after receiving SIGTERM. In short:

  1. When kubelet decides to terminate a pod, it sends a SIGTERM to its containers.
  2. These processes must continue to process connections, as peers may not observe the pod's removal from a service immediately.
  3. After the pod's terminationGracePeriodSeconds, the kubelet sends SIGKILL to all remaining pods.

Currently, we initiate process shutdown as soon as SIGTERM is received. In the proxy, we do this by refusing new connections/requests while letting existing connections/requests complete. The proxy terminates once all connections are closed.

This is incorrect!

Instead, we need to... do nothing.

This also applies to all of our controllers (both Go & Rust).

jeremychase commented 1 year ago

After discussion we agreed that linkerd-proxy is probably not handling shutdown correctly, but the correct behavior is still up for debate.

Thoughts:

olix0r commented 1 year ago

After some further rumination...

With this in mind, I suggest the following behavior for Linkerd components:

adleong commented 1 year ago

In an ideal world, here's how I imagine this working:

The benefit to this approach is that we don't establish a connection to a server that we know will be forcefully killed within the next gracePeriod. The work happening on the connection has the best chance to succeed if it is established to a server that isn't known to be terminating soon. In a sense, refusing new connections augments service discovery by guiding clients away from the terminating server, even if they haven't received that information from service discovery yet.

However, a few questions about if this "ideal" world is realistic:

Another question: is our strategy different when we control both the server and the client? For example, in communication from the Linkerd proxy to the Linkerd control plane, we can control the behavior on both sides. Does it make sense to handle graceful termination differently here vs when we don't control the client and can't know how it will handle rejected connections?

olix0r commented 1 year ago
  • When a pod gets SIGTERM, it stops accepting new connections but allows in-flight connections to progress
  • Clients may not yet know the pod is terminating and may attempt to establish a connection to it, which will be rejected
  • Since the connection was rejected before any payload bytes were sent, the client can safely retry the connection. This retry will eventually hit another server replica, either by chance or because the client will eventually get the service discovery update.

The benefit to this approach is that we don't establish a connection to a server that we know will be forcefully killed within the next gracePeriod. The work happening on the connection has the best chance to succeed if it is established to a server that isn't known to be terminating soon. In a sense, refusing new connections augments service discovery by guiding clients away from the terminating server, even if they haven't received that information from service discovery yet.

This problem with this approach is that it basically requires that clients see connection errors. But these errors are entirely avoidable. Clients can get discovery updates on their own and gracefully move load. When we fail to allow connections, we introduce more errors into the system. Your argument is that these errors are inherently better than errors that might be encountered if connections are accepted and then later terminated, but I don't agree with this assertion.

If an operator configures a terminationGracePeriodSeconds: 30, so that their clients have 30 seconds to gracefully move load, then I think it's reasonable to honor that configuration so that clients can properly respond to discovery updates and all changes can be driven safely, without any interruption.

I don't think it's appropriate to for Linkerd's behavior to be less graceful than the application's. If it continues to accept connections, so should we.

In my view, the only benefit of your proposed approach is that it would allow pods to terminate more quickly without potentially running idle during the termination grace period. I.e., if service discovery propagates within 3 seconds, we could spend 27 seconds doing nothing.

olix0r commented 1 year ago

To put it another way: we want to allow for some grace period between a kubelet announcing its intention to kill a container and our decision to stop allowing connections. There may be active connections in flight that we are about to accept. We want to be more graceful in this condition.

We need to stop accepting connections at some point, but when? Well, after some grace period so that (1) in-flight work is processed and (2) clients have an opportunity to observe the discovery change and adapt their traffic. My main argument is that this grace period and terminationGracePeriodSeconds are functionally the same thing. I think the alternative proposal is that this grace period should be zero?

I think it's interesting that there probably is not much value in unbinding the listener and draining active connections: we should expect service discovery updates to have propagated and clients should not be establishing new connections late in the grace period.

adleong commented 1 year ago

Your argument is that these errors are inherently better than errors that might be encountered if connections are accepted and then later terminated

Yes, because if no bytes are written to the connection, it is safe to retry. I recall the "RequeueFilter" from Finagle which automatically retried this kind of connection failure.

we should expect service discovery updates to have propagated and clients should not be establishing new connections late in the grace period.

Well, yes, if service discovery updates were instantaneous, we wouldn't expect to get any new connections during the grace period. But the debate is what to do about connections which are established while the update is propagating. I think the choices are A) accept the connection and hope that the work can be completed in the remaining time before the server is SIGKILLed or B) reject the connection so that the client can establish this connection to another server replica instead which.

I don't know how to quantify "gracefulness" but option B minimizes errors surfaced to the upstream caller and I think that's desirable.

olix0r commented 1 year ago

It's probably helpful to be clear about some assumptions. I assume:

In this golden case, clients will voluntarily remove a server replica via discovery and route requests elsewhere within a few seconds. If a server replica stops accepting connections before that can happen, we don't allow any time for that graceful migration.

My arguments are:

  1. We should allow some grace period for clients to adapt before we do anything to fail connections/requests. I suppose a user could configure that to be 0 if desired for some reason.
  2. I haven't heard a good argument for why that grace period should be anything other than terminationGracePeriodSeconds--where the user has already expressed an intention for this limit.
adleong commented 1 year ago

Digging into those assumptions some more, here's the client behavior that I'm imagining and that I think is very common in Kubernetes:

A client establishes a connection or a pool of connections to the cluster ip of a Kubernetes service. It then sends HTTP requests on those connections for as long as those connections remain open, establishing new connections only when existing connections are closed or when necessary due to increased concurrency. Under the hood, kube-proxy proxies those connections as they are established to backend replicas. When kube-proxy observes a service discovery update, it will stop establishing new connections to the terminating pod, but any existing connections will remain and since the client isn't observing service discovery directly, that client will continue to use those connections instead of gracefully draining traffic.

When the client is meshed, it can handle this better because when the outbound proxy gets a service discovery update, it will remove the terminating pod from its load balancer and requests will be drained from it. However, we still have the same problem for TCP traffic where any connections established before Linkerd gets the service discovery update will persist until the server is forcefully killed.

GabrielAlacchi commented 1 year ago

I would appreciate if the shutdown behavior would be as configurable as possible. Different use-cases are going to warrant different behaviors to optimize shutdown. It may make sense to stop accepting connections immediately upon receiving SIGTERM in some contexts while in others you may want to continue accepting new connections.

For instance, one example where I'd prefer accepting new connections after SIGTERM is when using Ingress NGINX Controller. Since it's functioning as a proxy, my team has observed that Nginx may continue opening upstream connections while draining connections during shutdown if it has no existing keepalive connection to that upstream service. In order to gracefully process any in-flight requests it needs the ability to keep opening new connections. We currently use config.alpha.linkerd.io/proxy-wait-before-exit-seconds to work around this.

Tolsto commented 1 year ago

It is especially annoying that linkerd-proxy will also reject new outgoing connections after receiving SIGTERM. This results in errors when the pod still has to process existing requests which often requires making connections to external services like databases or APIs.

Even worse, when running linkerd on the edge, e.g. ingress controllers, HTTP persistent connections from upstream proxies (e.g. WAF) are a major issue. When the pod starts the shutdown procedure the existing TCP connections will not get terminated and a lot of new HTTP requests will be received through them until the ingress controller starts shutting down the idle connections. However, since linkerd-proxy will refuse new connections directly after SIGTERM all of those requests that would require a new connection to a workload backend will fail. In my opinion, doing nothing on SIGTERM and waiting for SIGKILL would be the best default behavior for the vast majority of use cases. But at the very least, linkerd-proxy should not interfere with outgoing connections.

However, just waiting for SIGKILL might not be a good approach in cases where the pod has a very high value for terminationGracePeriodSeconds. For instance, some stateful services like RabbitMQ use a very long grace period in order to ensure that the service has enough time to persist the state. Usually, it will finish long before SIGKILL would be triggered and shuts down the main container which will stop the pod. If linkerd-proxy would wait for SIGKILL then it might keep the pod running for hours. IMHO the ideal solution would be an optional configuration that instructs linkerd-proxy to shut down once certain containers don't run anymore. Argo Workflows has an implementation for watching the main container that works quite well.

nathanmcgarvey-modopayments commented 1 year ago

In line with @Tolsto , my primary use case for this change is for outbound connections that originate from within the terminating pod at shutdown. This is to dump everything from last-state, to metrics, to even flushing in-memory messages to a persistent queuing service. In many cases, I'd probably just as soon use config.alpha.linkerd.io/proxy-wait-before-exit-seconds with a buffer equal to the terminationGrace period in production environments as any other solution, but that's also because I can set it low enough to not be impactful. In some cases, I'd rather linkerd ignore the SIGTERM from k8s entirely, and just either make us hit the terminationGracePeriodSeconds seconds or SIGKILL, or we can treat it like we do Jobs currently, and manually make a call to the admin endpoint (with linkerd-await or manually).

nathanmcgarvey-modopayments commented 1 year ago

At it's core, I feel like this issue is similar, in theory, as https://github.com/linkerd/linkerd2/issues/8006

Basically trying to judge whether or not the proxy should continue running based on network traffic is heuristic guesswork and basing it on a configuration variable is an anti-pattern (basically like putting a "sleep" in your code: it's usually indicative of a bigger issue). The "real" solution (IMHO, obviously) is to wait for all non-proxy containers in the pod to be Terminated state after a SIGTERM is received. Unfortunately, that requires k8s API access, albeit, not much, or a shared process namespace to "see" something of the other pods, but that's a security issue, likely.

Since linkerd-proxy sees network traffic, can it detect and "smell" kubernetes probes going to individual containers? If so, you may be able to improve that network heuristic by basically issuing your own "liveness" checks after sigterm is received to localhost.

Tolsto commented 1 year ago

Kubernetes 1.28 brings "API awareness of sidecar containers". Containers marked as sidecars will then get terminated automatically once all non-sidecar containers have shut down.

nathanmcgarvey-modopayments commented 1 year ago

Kubernetes 1.28 brings "API awareness of sidecar containers". Containers marked as sidecars will then get terminated automatically once all non-sidecar containers have shut down.

Sadly, in alpha though, and as a feature gate.

howardjohn commented 1 year ago

You might consider sending GOAWAY/Connection:close during the exit period rather than nothing to encourage clients to connect to other servers

nathanmcgarvey-modopayments commented 9 months ago

Kubernetes 1.28 brings "API awareness of sidecar containers". Containers marked as sidecars will then get terminated automatically once all non-sidecar containers have shut down.

Sadly, in alpha though, and as a feature gate.

Now in beta for 1.29! (https://github.com/kubernetes/kubernetes/pull/121579)

stale[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

nathanmcgarvey-modopayments commented 6 months ago

I think this should be slated for GA once k8s has the feature in GA. It's roadmapped for k8s 1.32, right now, but sometimes those change forward or backward.

stale[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

wmorgan commented 3 months ago

Native sidecar support has been out since Linkerd 2.15 and I believe we already follow the suggestions from @howardjohn above, so I think we can just close this issue as done?

stale[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.