projectcontour / contour

Contour is a Kubernetes ingress controller using Envoy proxy.
https://projectcontour.io
Apache License 2.0
3.72k stars 677 forks source link

shutdown-manager issues #4851

Closed skriss closed 11 months ago

skriss commented 1 year ago

xref. #3192 xref. #4322 xref. #4812

We've encountered a number of interrelated issues with the graceful Envoy shutdown workflow that I want to capture in one place.

The first issue goes something like the following:

  1. Something triggers Kubernetes to terminate/restart the shutdown-manager sidecar container in an Envoy pod. We've seen definite evidence of its liveness probes failing due to timeouts, which could be caused by the container not having any resource requests and getting CPU-throttled.
  2. When the shutdown-manager container is terminated, its preStop hook is triggered, which runs the contour envoy shutdown command. This command tells the associated Envoy to start draining all Listeners, including the Listener that is the target of the Envoy container's readiness probe. So now, Envoy is draining HTTP/S connections as well as reporting unready.
  3. The shutdown-manager's preStop hook eventually returns once enough connections have been closed, and the shutdown-manager container restarts successfully.
  4. The Envoy container is stuck indefinitely in a "draining" state, failing its readiness probe, but never restarting because there is no liveness probe or anything else to trigger a restart.

A few thoughts about this issue:

Secondly, #4322 describes issues with draining nodes due to the emptyDir that is used to allow the shutdown-manager and envoy containers to communicate (via UDS for the envoy admin interface, and by file to communicate when the Listener drain is done). I won't repeat that discussion here, just linking it for reference.

skriss commented 1 year ago

A few more thoughts:

cc @sunjayBhatia, interested in your thoughts here when you get a chance.

sunjayBhatia commented 1 year ago

yeah I agree that idea is the best solution for functionality, I don't think its too hard on the face of it to implement, but has some implications for downstream users that mean we would need very good communication in our release notes and upgrade documentation

stepping through what would need to happen to do that:

we could also probably think of a nice gradual release process to this final state that is minimally disruptive, but of course that will take time, though interested users could probably do a jump-upgrade change if needed

sunjayBhatia commented 1 year ago

Yeah it seems the only other option is to add a liveness probe to the envoy container.

One other quick idea i just had around getting a pod to "recover" after the shutdown-manager is restarted would be to get the shutdown-manager to ensure on startup that Envoy was in the right state. This won't work because you can't recover an Envoy from "draining," once it goes into that state it won't allow Listener updates/modifications (see Attention block here: https://www.envoyproxy.io/docs/envoy/latest/operations/admin#post--drain_listeners).

We could also have the shutdown-manager on startup check if Envoy was draining (using https://www.envoyproxy.io/docs/envoy/latest/api-v3/admin/v3/server_info.proto#envoy-v3-api-enum-admin-v3-serverinfo-state) and have it wait for the drain to finish and itself make Envoy exit (https://www.envoyproxy.io/docs/envoy/latest/operations/admin#post--quitquitquit). But again this seems worse than using proper pod lifecycle hooks for this and doesn't solve the emptyDir issue.

sunjayBhatia commented 1 year ago

But tbh I don't know if we've tried to move to this, but this endpoint could be useful: https://www.envoyproxy.io/docs/envoy/latest/operations/admin#post--drain_listeners

(rather than set the healthcheck to fail and wait for existing connections to finish)

sunjayBhatia commented 1 year ago

Opened this: https://github.com/projectcontour/contour/issues/4852

skriss commented 1 year ago

Few more notes so I don't forget:

skriss commented 1 year ago

https://github.com/projectcontour/contour/compare/main...skriss:contour:exp-sdm-requests-and-probes removes the liveness probe from the shutdown-manager container, adds a liveness probe to the envoy container, and adds resource requests to both containers. This should mitigate the issue by resulting in fewer shutdown-manager restarts, and enabling envoy to recover if/when it does get stuck in a "draining" state.

https://github.com/projectcontour/contour/compare/main...skriss:contour:exp-sdm-in-envoy gets rid of the shutdown-manager sidecar, creates a single Docker image with both contour and envoy binaries, and runs the contour envoy shutdown command as a preStop hook in the consolidated envoy container. This approach avoids most of the issues related to coordinating between multiple containers in the pod. Note that, because we're still using an emptyDir volume for the bootstrap config, this does not entirely solve #4322, but does get us a step closer.

skriss commented 1 year ago

Coming back to this, here's what I propose doing for the upcoming 1.24 release:

  1. Remove the shutdown-manager's liveness probe. This will make it less likely that the shutdown-manager gets restarted by itself, and even if it does become unresponsive for any reason, the worst thing that happens is that the Pod shuts down without first gracefully draining connections, which is not as bad as the Pod getting permanently stuck in an unresponsive state.
  2. Add documentation to the installation instructions covering that resource requests should be added to all the containers, and providing some reasonable starting values (with the caveat that they will need to be tuned based on usage). I'd rather not add default values to the example YAML since those could cause problems for some users.
  3. Add a page to https://projectcontour.io/docs/v1.23.2/troubleshooting describing the overall issue in detail along with mitigations

I'd also like to hold off on adding a liveness probe to the Envoy container for now, since getting it wrong has the potential to cause more issues, but I do think it's something we should do, maybe for next release, once we have a chance to fully tune all the params.

skriss commented 1 year ago

Moving this to 1.25.0 for any remaining follow-ups.

rajatvig commented 1 year ago

Here's what we have done to solve the shutdown manager issues and avoiding the emptyDir issues altogether.

Create a custom Envoy image with contour and a custom shutdown script.

FROM ghcr.io/projectcontour/contour:v1.23.1 as contour

FROM envoyproxy/envoy:v1.24.1

LABEL MAINTAINER Runtime <etsyweb-runtime@etsy.com>

COPY --from=contour //bin/contour /usr/local/bin/contour
COPY start.sh /usr/local/bin/start.sh
COPY shutdown.sh /usr/local/bin/shutdown.sh

RUN chmod +x /usr/local/bin/start.sh && chmod +x /usr/local/bin/shutdown.sh

ENTRYPOINT ["/usr/local/bin/start.sh"]

The scripts are

  1. start.sh
    
    #!/usr/bin/env bash

set -eou pipefail

mkdir -p /tmp/config mkdir -p /tmp/admin

echo "bootstrapping envoy"

contour bootstrap \ /tmp/config/envoy.json \ --admin-address="/tmp/admin/admin.sock" \ --xds-address=contour \ --xds-port=8001 \ --resources-dir=/tmp/config/resources \ --envoy-cafile=/certs/ca.crt \ --envoy-cert-file=/certs/tls.crt \ --envoy-key-file=/certs/tls.key

echo "bootstrap succeeded"

envoy "$@"


2. shutdown.sh
```bash
#!/usr/bin/env bash

set -eou pipefail

if ! pidof envoy &>/dev/null; then
  exit 0
fi

echo "starting shutdown process"

contour envoy shutdown \
  --drain-delay=30s \
  --check-delay=10s \
  --admin-address=/tmp/admin/admin.sock \
  --ready-file=/tmp/admin/ok

echo "envoy is shutdown"

The rest is binding the preStop hook to call the shutdown.sh script. The wait etc works fine for the LoadBalancer + Kubelet to realise the pod is in terminating stage but allows Envoy to process requests until that happens.

skriss commented 1 year ago

Thanks @rajatvig, that is great information.

github-actions[bot] commented 1 year ago

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

You can:

Please send feedback to the #contour channel in the Kubernetes Slack

sunjayBhatia commented 1 year ago

An additional issue encountered when running Gateway API conformance tests is that when an Envoy pod is created and then deleted before the shutdown manager becomes ready, the pre-stop HTTP call to the shutdown manager fails which leaves the pod in Terminating state until the graceful cleanup timout

github-actions[bot] commented 1 year ago

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

You can:

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions[bot] commented 11 months ago

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

You can:

Please send feedback to the #contour channel in the Kubernetes Slack

a5r0n commented 9 months ago

/reopen ?