QUIT Signal Handling Inconsistencies with Proxy Process in Teleport

Expected behavior:

Issuing a QUIT signal to a Teleport proxy process with active sessions through a trusted root cluster should gracefully shut down the proxy, allowing sessions to conclude naturally. The diagnostics server (/readyz, /healthz endpoints) should continue reporting the health and readiness of the proxy until the process has fully exited.

Current behavior:

Upon issuing a QUIT signal (kill -QUIT ) to a Teleport proxy process, sessions that are routed through a trusted root cluster are terminated abruptly rather than closing gracefully. Additionally, the diagnostics server becomes unresponsive immediately, refusing connections and not providing health status or metrics about ongoing sessions. This means that Kubernetes may kill the pod before the connections gracefully end on their own, and metrics are lost during the time that Teleport is shutting down.

Bug details:

Teleport version: 13.4.5
Recreation steps:
1. Launch a Teleport proxy process and establish active sessions through a trusted root cluster.
2. Issue a QUIT signal to the Teleport proxy process (kill -QUIT <TELEPORT_PID>).
3. Observe that existing trusted cluster sessions are terminated abruptly.
4. Attempt to access the diagnostics server's /readyz (e.g., curl http://localhost:3000/readyz) and observe connection refusal.
Debug logs:

Immediately after graceful shutdown, these "listener is closed" traces appear when there are active connections from a trusted cluster.

2023-11-13T14:25:30Z INFO [DIAG:1]    Shutting down gracefully. pid:74.1 service/service.go:3078
2023-11-13T14:25:30Z WARN [PROC:1]    Teleport process has exited with error. error:[
ERROR REPORT:
Original Error: *trace.ConnectionProblemError listener is closed
Stack Trace:
    github.com/gravitational/teleport/lib/srv/alpnproxy/listener.go:85 github.com/gravitational/teleport/lib/srv/alpnproxy.(*ListenerMuxWrapper).Accept
    google.golang.org/grpc@v1.58.2/server.go:859 google.golang.org/grpc.(*Server).Serve
    github.com/gravitational/teleport/lib/service/service.go:5820 github.com/gravitational/teleport/lib/service.(*TeleportProcess).initPublicGRPCServer.func1
    github.com/gravitational/teleport/lib/service/supervisor.go:546 github.com/gravitational/teleport/lib/service.(*LocalService).Serve
    github.com/gravitational/teleport/lib/service/supervisor.go:281 github.com/gravitational/teleport/lib/service.(*LocalSupervisor).serve.func1
    runtime/asm_amd64.s:1598 runtime.goexit
User Message: listener is closed] pid:74.1 service:proxy.grpc.public service/supervisor.go:286

As soon as QUIT is received, diagnostics and health endpoints become unavailable almost immediately:

2023-11-13T14:45:55Z WARN [DIAG:1]    Diagnostic server exited with error: accept tcp [::]:3000: use of closed network connection. pid:145.1 service/service.go:3068
2023-11-13T14:45:55Z INFO [UPLOAD:1]  File uploader is shutting down. pid:145.1 service/service.go:2839
2023-11-13T14:45:55Z INFO [UPLOAD:1]  File uploader has shut down. pid:145.1 service/service.go:2841
2023-11-13T14:45:55Z INFO [DIAG:1]    Shutting down gracefully. pid:145.1 service/service.go:3078

Right now each service gracefully shuts down on its own, and we have no real concept of dependencies between services other than some bespoke fixed things such as "the auth server must start before everything else, if enabled".

All external listeners are closed as the first step in the graceful shutdown because a new Teleport process (either in a new host process, or as part of the internal restart behavior caused by a CA rotation) might be taking its place; we recently added a way to discern whether or not a shutdown is because of a Teleport-driven restart, so we could use that mechanism to delay closing the metrics endpoint until everything else has closed.

This means that Kubernetes may kill the pod before the connections gracefully end on their own

I don't think that Kubernetes will kill a pod in Terminating state before the end of the grace period (at which point it would kill it anyway if it's still up).

and metrics are lost during the time that Teleport is shutting down.

Yes, that's quite annoying when looking at metrics :(

gravitational / teleport