hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

Mechanism to drain Envoy connections #8304

Open freddygv opened 4 years ago

freddygv commented 4 years ago

If users need to take a proxy/gateway out of commission there is no great way to do that at the moment.

The primary method for graceful shutdowns of Envoy is to use the /healthcheck/fail endpoint (See ENVOY-1990).

Calling this endpoint will lead Envoy to:

  1. Send failed health check responses to downstream proxies
  2. Start drain closing connections

However, we do not use Envoy's active health checking at the moment, so step 1 would do nothing. Connect proxies determine the health of upstream proxies through Consul, not through direct checks.

Consul should enable graceful connection draining somehow. Here are a few options:

  1. Modify Consul checks on Envoy proxies to be HTTP checks rather than TCP. Listeners would have an HTTP check filter in no pass through mode. Then, whenever a user calls /healthcheck/fail, a 503 would be returned whenever Consul probes the proxy. (Not sure if stacking this HTTP check filter on a TCP listener would work)

  2. Add a new Consul check type that is only set to passing/failing via a manual toggle. This would be similar to a TTL check, except without an interval. This check type would be used for auto-registered Connect proxies and setting it to critical would prevent Connect traffic from being sent to that proxy. Users or Consul could then send a request to /drain_listeners so Envoy sends Connection: close on HTTP request completions.

  3. Add HTTP endpoints to /agent/checks/ that enable users to force proxy TCP checks to become critical or passing. Making a check critical would disable the check runner, and making it passing would re-enable it. The listener draining side-effect from option 2 is also an option here.

radykal-com commented 2 years ago

Hi,

Is there any plan on implementing this? When doing blue/green deployments we would like to drain connections in a graceful way to the old deployment before taking it down. We cannot see how to do it at this moment.