Open programmerq opened 12 months ago
Right now each service gracefully shuts down on its own, and we have no real concept of dependencies between services other than some bespoke fixed things such as "the auth server must start before everything else, if enabled".
All external listeners are closed as the first step in the graceful shutdown because a new Teleport process (either in a new host process, or as part of the internal restart behavior caused by a CA rotation) might be taking its place; we recently added a way to discern whether or not a shutdown is because of a Teleport-driven restart, so we could use that mechanism to delay closing the metrics endpoint until everything else has closed.
This means that Kubernetes may kill the pod before the connections gracefully end on their own
I don't think that Kubernetes will kill a pod in Terminating state before the end of the grace period (at which point it would kill it anyway if it's still up).
and metrics are lost during the time that Teleport is shutting down.
Yes, that's quite annoying when looking at metrics :(
Expected behavior:
Issuing a QUIT signal to a Teleport proxy process with active sessions through a trusted root cluster should gracefully shut down the proxy, allowing sessions to conclude naturally. The diagnostics server (/readyz, /healthz endpoints) should continue reporting the health and readiness of the proxy until the process has fully exited.
Current behavior:
Upon issuing a QUIT signal (kill -QUIT) to a Teleport proxy process, sessions that are routed through a trusted root cluster are terminated abruptly rather than closing gracefully. Additionally, the diagnostics server becomes unresponsive immediately, refusing connections and not providing health status or metrics about ongoing sessions. This means that Kubernetes may kill the pod before the connections gracefully end on their own, and metrics are lost during the time that Teleport is shutting down.
Bug details:
Teleport version: 13.4.5
Recreation steps:
kill -QUIT <TELEPORT_PID>
).curl http://localhost:3000/readyz
) and observe connection refusal.Debug logs:
Immediately after graceful shutdown, these "listener is closed" traces appear when there are active connections from a trusted cluster.
As soon as QUIT is received, diagnostics and health endpoints become unavailable almost immediately: