Native Sidecars draining erroneously sending connection: close for outbound requests

Stono commented 1 month ago

Is this the right place to submit this?

[X] This is not a security vulnerability or a crashing bug
[X] This is not a question about how to use Istio

Bug Description

When using native sidecars, the following lifecycle is added:

    lifecycle:
      preStop:
        exec:
          command:
          - pilot-agent
          - request
          - --debug-port=15020
          - POST
          - drain

The preStop event fires when the Pod enters a terminating state. At this point, the proxy starts sending connection: close for outbound requests, eg the app will still be making outbound calls at this point, but getting a connection: close back.

You can see this by doing the following:

Add a preStop: sleep 10 to your app container
Exec into the pod, and setup a watch -n 1 curl <whatever>
kubectl delete the pod
Immediately see the connection: close headers being returned

This causes some of our http pool implementations to quite literally implode! :D

So far, I've only got the issue to manifest when the remote destination does not have a sidecar too, for example:

app1 | app1-proxy ----http2-----> remote app (no sidecar)

I do not believe we should be seeing connection: close for these outbound requests during the shutdown lifecycle of a pod.

Version

1.22.2

Additional Information

No response

howardjohn commented 1 month ago

This seems to be a bug in Envoy.

curl -X POST localhost:15000/'drain_listeners?inboundonly': works

curl -X POST localhost:15000/'drain_listeners?inboundonly&graceful': does not (that is, it applies to outbound traffic)

Stono commented 1 month ago

As a mitigation, i've removed the shutdown hook from our injection template. We already have some reasonable guards to aid with graceful shutdowns that have historically worked ok for us.

howardjohn commented 1 month ago

The logic in envoy:

  if (graceful) {
    // Ignore calls to /drain_listeners?graceful if the drain sequence has
    // already started.
    if (!server_.drainManager().draining()) {
      server_.drainManager().startDrainSequence([this, stop_listeners_type, skip_exit]() {
        if (!skip_exit) {
          server_.listenerManager().stopListeners(stop_listeners_type, {});
        }
      });
    }
  } else {
    server_.listenerManager().stopListeners(stop_listeners_type, {});
  }

You can see in graceful+skip_exit+inboundonly, the stop_listeners_type=inboundonly is not used anywhere

howardjohn commented 1 month ago

Opened https://github.com/envoyproxy/envoy/issues/35020

ramaraochavali commented 1 month ago

Is n't this a problem for regular sidecars as well?

Stono commented 1 month ago

No, the prestop hook to drain is only added when native sidecars is enabled

ramaraochavali commented 1 month ago

But the same logic gets executed when proxy gets SIGTERM

Stono commented 1 month ago

If that is true then yes, it is a problem there too. The reason I might not have seen it historically is for years we've had a pilot-agent wrapper script that blocked SIGTERM getting to the proxy and did a while loop looking for netstat listening ports from the main app. We removed that "hacky" script as part of the move to native sidecar containers.

dwj300 commented 1 month ago

Is n't this a problem for regular sidecars as well?

@ramaraochavali I think it is, right? Let's say terminationDrainDuration is 5min, and drainDuration is 45s. Presumably, a workload could send a new outbound request after one minute, and the outbound listener would already be shutdown...

ramaraochavali commented 1 month ago

yes

istio / istio