Do we handle shutdown incorrectly?

olix0r commented 1 year ago

Someone recently brought this thread to my attention. Tim argues that applications must continue to handle new connections/requests after receiving SIGTERM. In short:

When kubelet decides to terminate a pod, it sends a SIGTERM to its containers.
These processes must continue to process connections, as peers may not observe the pod's removal from a service immediately.
After the pod's terminationGracePeriodSeconds, the kubelet sends SIGKILL to all remaining pods.

Currently, we initiate process shutdown as soon as SIGTERM is received. In the proxy, we do this by refusing new connections/requests while letting existing connections/requests complete. The proxy terminates once all connections are closed.

This is incorrect!

Instead, we need to... do nothing.

This also applies to all of our controllers (both Go & Rust).

jeremychase commented 1 year ago

After discussion we agreed that linkerd-proxy is probably not handling shutdown correctly, but the correct behavior is still up for debate.

Thoughts:

The proxy would ideally continue to accept new connections so long as its server is still functional.
The current behavior of refusing new connections/requests after receiving SIGTERM means we are breaking prometheus scrapes, etc.
After receiving SIGTERM a negative acknowledgement by the proxy to new connections seems to be the more correct approach. However, it isn't clear if the proxy should do this while its server is still up or if the proxy should wait until it saw closed connections, or some other indication.

olix0r commented 1 year ago

After some further rumination...

All clients perform some form of service discovery (DNS or otherwise) to identify server addresses to connect to.
Per Tim's above-referenced social media post, we can't rely on any strict ordering between a container receiving a SIGTERM and its clients observing discovery updates that reflect that change.
In order to gracefully shutdown a server, we want to allow clients to gracefully move traffic elsewhere on their own accord (i.e. after observing a discovery update).
Eventually, the server will have to terminate, though.
So, we allow for a grace period of some sort between the shutdown process starting and the server actually terminating.
Kubernetes currently provides a pod setting for this: terminationGracePeriodSeconds.
- The sidecar KEP proposes additional grace period settings and lifecycle semantics for sidecar containers.

With this in mind, I suggest the following behavior for Linkerd components:

When a process receives a SIGTERM it should stop requesting new traffic, but it should continue to serve traffic to the best of its ability. This includes:
- Failing readiness probes after SIGTERM is received to advertise pending unavailability.
- Releasing Lease resources or other exclusive claims to allow other instances to take over.
Processes do not exit. Kubelet will send a SIGKILL when it deems it appropriate to do so (i.e., after terminationGracePeriodSeconds has elapsed).
We probably want to add checks that a termination grace period is set for injected workloads?

adleong commented 1 year ago

In an ideal world, here's how I imagine this working:

When a pod gets SIGTERM, it stops accepting new connections but allows in-flight connections to progress
Clients may not yet know the pod is terminating and may attempt to establish a connection to it, which will be rejected
Since the connection was rejected before any payload bytes were sent, the client can safely retry the connection. This retry will eventually hit another server replica, either by chance or because the client will eventually get the service discovery update.

The benefit to this approach is that we don't establish a connection to a server that we know will be forcefully killed within the next gracePeriod. The work happening on the connection has the best chance to succeed if it is established to a server that isn't known to be terminating soon. In a sense, refusing new connections augments service discovery by guiding clients away from the terminating server, even if they haven't received that information from service discovery yet.

However, a few questions about if this "ideal" world is realistic:

What does it mean to "reject" connections in this way? The OS will typically automatically accept connections if a process is listening on that port. Is there a way to prevent that behavior or to stop listening without impacting established connections?
I assumed a client will retry a rejected connection but really we have no control over how clients will handle this. What server behavior will "play nice" with a majority of clients in the ecosystem?

Another question: is our strategy different when we control both the server and the client? For example, in communication from the Linkerd proxy to the Linkerd control plane, we can control the behavior on both sides. Does it make sense to handle graceful termination differently here vs when we don't control the client and can't know how it will handle rejected connections?

olix0r commented 1 year ago

When a pod gets SIGTERM, it stops accepting new connections but allows in-flight connections to progress

Clients may not yet know the pod is terminating and may attempt to establish a connection to it, which will be rejected

Since the connection was rejected before any payload bytes were sent, the client can safely retry the connection. This retry will eventually hit another server replica, either by chance or because the client will eventually get the service discovery update.

The benefit to this approach is that we don't establish a connection to a server that we know will be forcefully killed within the next gracePeriod. The work happening on the connection has the best chance to succeed if it is established to a server that isn't known to be terminating soon. In a sense, refusing new connections augments service discovery by guiding clients away from the terminating server, even if they haven't received that information from service discovery yet.

This problem with this approach is that it basically requires that clients see connection errors. But these errors are entirely avoidable. Clients can get discovery updates on their own and gracefully move load. When we fail to allow connections, we introduce more errors into the system. Your argument is that these errors are inherently better than errors that might be encountered if connections are accepted and then later terminated, but I don't agree with this assertion.

If an operator configures a terminationGracePeriodSeconds: 30, so that their clients have 30 seconds to gracefully move load, then I think it's reasonable to honor that configuration so that clients can properly respond to discovery updates and all changes can be driven safely, without any interruption.

I don't think it's appropriate to for Linkerd's behavior to be less graceful than the application's. If it continues to accept connections, so should we.

In my view, the only benefit of your proposed approach is that it would allow pods to terminate more quickly without potentially running idle during the termination grace period. I.e., if service discovery propagates within 3 seconds, we could spend 27 seconds doing nothing.

olix0r commented 1 year ago

To put it another way: we want to allow for some grace period between a kubelet announcing its intention to kill a container and our decision to stop allowing connections. There may be active connections in flight that we are about to accept. We want to be more graceful in this condition.

We need to stop accepting connections at some point, but when? Well, after some grace period so that (1) in-flight work is processed and (2) clients have an opportunity to observe the discovery change and adapt their traffic. My main argument is that this grace period and terminationGracePeriodSeconds are functionally the same thing. I think the alternative proposal is that this grace period should be zero?

I think it's interesting that there probably is not much value in unbinding the listener and draining active connections: we should expect service discovery updates to have propagated and clients should not be establishing new connections late in the grace period.

adleong commented 1 year ago

Your argument is that these errors are inherently better than errors that might be encountered if connections are accepted and then later terminated

Yes, because if no bytes are written to the connection, it is safe to retry. I recall the "RequeueFilter" from Finagle which automatically retried this kind of connection failure.

we should expect service discovery updates to have propagated and clients should not be establishing new connections late in the grace period.

Well, yes, if service discovery updates were instantaneous, we wouldn't expect to get any new connections during the grace period. But the debate is what to do about connections which are established while the update is propagating. I think the choices are A) accept the connection and hope that the work can be completed in the remaining time before the server is SIGKILLed or B) reject the connection so that the client can establish this connection to another server replica instead which.

I don't know how to quantify "gracefulness" but option B minimizes errors surfaced to the upstream caller and I think that's desirable.

olix0r commented 1 year ago

It's probably helpful to be clear about some assumptions. I assume:

The vast majority of traffic is short-lived messages (e.g., HTTP req/rsp completing in <1s).
In the majority of cases, service discovery results are observed quickly (within a few seconds).

In this golden case, clients will voluntarily remove a server replica via discovery and route requests elsewhere within a few seconds. If a server replica stops accepting connections before that can happen, we don't allow any time for that graceful migration.

My arguments are:

We should allow some grace period for clients to adapt before we do anything to fail connections/requests. I suppose a user could configure that to be 0 if desired for some reason.
I haven't heard a good argument for why that grace period should be anything other than terminationGracePeriodSeconds--where the user has already expressed an intention for this limit.

adleong commented 1 year ago

Digging into those assumptions some more, here's the client behavior that I'm imagining and that I think is very common in Kubernetes:

A client establishes a connection or a pool of connections to the cluster ip of a Kubernetes service. It then sends HTTP requests on those connections for as long as those connections remain open, establishing new connections only when existing connections are closed or when necessary due to increased concurrency. Under the hood, kube-proxy proxies those connections as they are established to backend replicas. When kube-proxy observes a service discovery update, it will stop establishing new connections to the terminating pod, but any existing connections will remain and since the client isn't observing service discovery directly, that client will continue to use those connections instead of gracefully draining traffic.

When the client is meshed, it can handle this better because when the outbound proxy gets a service discovery update, it will remove the terminating pod from its load balancer and requests will be drained from it. However, we still have the same problem for TCP traffic where any connections established before Linkerd gets the service discovery update will persist until the server is forcefully killed.

GabrielAlacchi commented 1 year ago

I would appreciate if the shutdown behavior would be as configurable as possible. Different use-cases are going to warrant different behaviors to optimize shutdown. It may make sense to stop accepting connections immediately upon receiving SIGTERM in some contexts while in others you may want to continue accepting new connections.

For instance, one example where I'd prefer accepting new connections after SIGTERM is when using Ingress NGINX Controller. Since it's functioning as a proxy, my team has observed that Nginx may continue opening upstream connections while draining connections during shutdown if it has no existing keepalive connection to that upstream service. In order to gracefully process any in-flight requests it needs the ability to keep opening new connections. We currently use config.alpha.linkerd.io/proxy-wait-before-exit-seconds to work around this.

Tolsto commented 1 year ago

It is especially annoying that linkerd-proxy will also reject new outgoing connections after receiving SIGTERM. This results in errors when the pod still has to process existing requests which often requires making connections to external services like databases or APIs.

Even worse, when running linkerd on the edge, e.g. ingress controllers, HTTP persistent connections from upstream proxies (e.g. WAF) are a major issue. When the pod starts the shutdown procedure the existing TCP connections will not get terminated and a lot of new HTTP requests will be received through them until the ingress controller starts shutting down the idle connections. However, since linkerd-proxy will refuse new connections directly after SIGTERM all of those requests that would require a new connection to a workload backend will fail. In my opinion, doing nothing on SIGTERM and waiting for SIGKILL would be the best default behavior for the vast majority of use cases. But at the very least, linkerd-proxy should not interfere with outgoing connections.

However, just waiting for SIGKILL might not be a good approach in cases where the pod has a very high value for terminationGracePeriodSeconds. For instance, some stateful services like RabbitMQ use a very long grace period in order to ensure that the service has enough time to persist the state. Usually, it will finish long before SIGKILL would be triggered and shuts down the main container which will stop the pod. If linkerd-proxy would wait for SIGKILL then it might keep the pod running for hours. IMHO the ideal solution would be an optional configuration that instructs linkerd-proxy to shut down once certain containers don't run anymore. Argo Workflows has an implementation for watching the main container that works quite well.

nathanmcgarvey-modopayments commented 1 year ago

In line with @Tolsto , my primary use case for this change is for outbound connections that originate from within the terminating pod at shutdown. This is to dump everything from last-state, to metrics, to even flushing in-memory messages to a persistent queuing service. In many cases, I'd probably just as soon use config.alpha.linkerd.io/proxy-wait-before-exit-seconds with a buffer equal to the terminationGrace period in production environments as any other solution, but that's also because I can set it low enough to not be impactful. In some cases, I'd rather linkerd ignore the SIGTERM from k8s entirely, and just either make us hit the terminationGracePeriodSeconds seconds or SIGKILL, or we can treat it like we do Jobs currently, and manually make a call to the admin endpoint (with linkerd-await or manually).

nathanmcgarvey-modopayments commented 1 year ago

At it's core, I feel like this issue is similar, in theory, as https://github.com/linkerd/linkerd2/issues/8006

Basically trying to judge whether or not the proxy should continue running based on network traffic is heuristic guesswork and basing it on a configuration variable is an anti-pattern (basically like putting a "sleep" in your code: it's usually indicative of a bigger issue). The "real" solution (IMHO, obviously) is to wait for all non-proxy containers in the pod to be Terminated state after a SIGTERM is received. Unfortunately, that requires k8s API access, albeit, not much, or a shared process namespace to "see" something of the other pods, but that's a security issue, likely.

Since linkerd-proxy sees network traffic, can it detect and "smell" kubernetes probes going to individual containers? If so, you may be able to improve that network heuristic by basically issuing your own "liveness" checks after sigterm is received to localhost.

Tolsto commented 1 year ago

Kubernetes 1.28 brings "API awareness of sidecar containers". Containers marked as sidecars will then get terminated automatically once all non-sidecar containers have shut down.

nathanmcgarvey-modopayments commented 1 year ago

Kubernetes 1.28 brings "API awareness of sidecar containers". Containers marked as sidecars will then get terminated automatically once all non-sidecar containers have shut down.

Sadly, in alpha though, and as a feature gate.

howardjohn commented 1 year ago

You might consider sending GOAWAY/Connection:close during the exit period rather than nothing to encourage clients to connect to other servers

nathanmcgarvey-modopayments commented 9 months ago

Kubernetes 1.28 brings "API awareness of sidecar containers". Containers marked as sidecars will then get terminated automatically once all non-sidecar containers have shut down.

Sadly, in alpha though, and as a feature gate.

Now in beta for 1.29! (https://github.com/kubernetes/kubernetes/pull/121579)

stale[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

nathanmcgarvey-modopayments commented 6 months ago

I think this should be slated for GA once k8s has the feature in GA. It's roadmapped for k8s 1.32, right now, but sometimes those change forward or backward.

stale[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

wmorgan commented 3 months ago

Native sidecar support has been out since Linkerd 2.15 and I believe we already follow the suggestions from @howardjohn above, so I think we can just close this issue as done?

stale[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

linkerd / linkerd2

Do we handle shutdown incorrectly? #10379