gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.37k stars 1.74k forks source link

Teleport crashes on reload of teleport.service with the diagnostics server enabled #25989

Open polarina opened 1 year ago

polarina commented 1 year ago

Issuing systemctl reload teleport.service causes the teleport service to crash and die.

In addition, the process exits with a status code of 0, which fails to trigger the Restart=on-failure condition of the systemd service.

The diagnostics server has to be enabled in /etc/teleport.yml for this issue to trigger:

teleport:
  diag_addr: '[::1]:53080'

We run a separate backround process that issues requests http://[::1]:53080/healthz and http://[::1]:53080/readyz once every second.

$ sudo journalctl -u teleport --no-hostname
-- Logs begin at Wed 2023-05-10 13:36:09 UTC, end at Wed 2023-05-10 13:46:41 UTC. --
May 10 13:36:33 systemd[1]: Started Teleport Service.
May 10 13:39:30 systemd[1]: Reloading Teleport Service.
May 10 13:39:30 systemd[1]: Reloaded Teleport Service.
May 10 13:39:37 teleport[1097]: 2023-05-10T13:39:37Z WARN [DIAG:1]    Diagnostic server exited with error: accept tcp [::1]:53080: use of closed network connection. pid:1097.1 service/service.go:2815
May 10 13:39:37 teleport[1097]: 2023-05-10T13:39:37Z WARN [PROXY:AGE] Agent pool exited. error:[
May 10 13:39:37 teleport[1097]: ERROR REPORT:
May 10 13:39:37 teleport[1097]: Original Error: *errors.errorString context canceled
May 10 13:39:37 teleport[1097]: Stack Trace:
May 10 13:39:37 teleport[1097]:         github.com/gravitational/teleport/lib/reversetunnel/agentpool.go:268 github.com/gravitational/teleport/lib/reversetunnel.(*AgentPool).run
May 10 13:39:37 teleport[1097]:         github.com/gravitational/teleport/lib/reversetunnel/agentpool.go:255 github.com/gravitational/teleport/lib/reversetunnel.(*AgentPool).Start.func1
May 10 13:39:37 teleport[1097]:         runtime/asm_amd64.s:1598 runtime.goexit
May 10 13:39:37 teleport[1097]: User Message: context canceled] localCluster: targetCluster:marel.teleport.sh reversetunnel/agentpool.go:256
May 10 13:39:37 teleport[1097]: 2023-05-10T13:39:37Z WARN [NODE:1:CA] Re-init the cache on error error:[
May 10 13:39:37 teleport[1097]: ERROR REPORT:
May 10 13:39:37 teleport[1097]: Original Error: *trace.ConnectionProblemError watcher is closed: error reading from server: use of closed network connection
May 10 13:39:37 teleport[1097]: Stack Trace:
May 10 13:39:37 teleport[1097]:         github.com/gravitational/teleport/lib/cache/cache.go:1119 github.com/gravitational/teleport/lib/cache.(*Cache).fetchAndWatch
May 10 13:39:37 teleport[1097]:         github.com/gravitational/teleport/lib/cache/cache.go:944 github.com/gravitational/teleport/lib/cache.(*Cache).update
May 10 13:39:37 teleport[1097]:         runtime/asm_amd64.s:1598 runtime.goexit
May 10 13:39:37 teleport[1097]: User Message: watcher is closed: error reading from server: use of closed network connection] cache/cache.go:950
May 10 13:39:37 systemd[1]: teleport.service: Succeeded.
May 10 13:39:37 systemd[1]: teleport.service: Consumed 1.236s CPU time, read 96.8M from disk, written 5.6M to disk, received 491.9K IP traffic, sent 192.3K IP traffic.

$ teleport version
Teleport v12.3.3 git:api/v12.3.3-0-g78eb115 go1.20.4
zmb3 commented 11 months ago

Hmmm... on 14.1.1 Teleport seems to recover for me.

2023-10-25T13:25:49-06:00 INFO [PROXY:SER] SSH proxy service 14.1.1:v14.1.1-0-gfb6429e is starting on [::]:3080 pid:2574.1 service/service.go:4305
2023-10-25T13:25:49-06:00 INFO [PROC:1]    The new service has started successfully. Starting syncing rotation status with period 10m0s. pid:2574.1 service/connect.go:684
2023-10-25T13:25:49-06:00 INFO [PROXY:AGE] Starting reverse tunnel agent pool. service/service.go:4349
2023-10-25T13:25:49-06:00 INFO [PROXY:PRO] Starting Kube proxy on [::]:3080. service/service.go:4455
2023-10-25T13:25:49-06:00 INFO [DB:SERVIC] Starting Database Postgres proxy server on [::]:3080. pid:2574.1 service/service.go:4554
2023-10-25T13:25:49-06:00 INFO [DB:SERVIC] Starting Database TLS proxy server on 0.0.0.0:3080. pid:2574.1 service/service.go:4572
2023-10-25T13:25:49-06:00 INFO [PROC:1]    Starting proxy gRPC server on [::]:3080. pid:2574.1 service/service.go:5937
2023-10-25T13:25:49-06:00 INFO [PROC:1]    Starting proxy gRPC server on [::]:3080. pid:2574.1 service/service.go:6011
2023-10-25T13:25:49-06:00 INFO [PROXY:SER] Starting TLS ALPN SNI proxy server on [::]:3080. pid:2574.1 service/service.go:4649
2023-10-25T13:25:49-06:00 INFO [PROC:1]    New service has started successfully. pid:2574.1 service/service.go:1258
2023-10-25T13:25:49-06:00 INFO [PROC:1]    Received message from pid 2574: Process 2574 has started. path:/usr/local/bin/teleport pid:2557.1 workingDir:/ service/signals.go:541
2023-10-25T13:25:49-06:00 INFO [PROC:1]    Child process signals success. path:/usr/local/bin/teleport pid:2574 workingDir:/ service/signals.go:549
2023-10-25T13:25:49-06:00 INFO [PROC:1]    Successfully started new process, shutting down gracefully. pid:2557.1 service/signals.go:121
2023-10-25T13:25:49-06:00 WARN [DIAG:1]    Diagnostic server exited with error: accept tcp [::1]:53080: use of closed network connection. pid:2557.1 service/service.go:3146
2023-10-25T13:25:49-06:00 WARN [PROC:1]    Teleport process has exited with error. error:[
ERROR REPORT:
Original Error: *errors.errorString context canceled
Stack Trace:
        github.com/gravitational/teleport/lib/auth/server_info.go:68 github.com/gravitational/teleport/lib/auth.(*Server).ReconcileServerInfos
        github.com/gravitational/teleport/lib/service/service.go:2091 github.com/gravitational/teleport/lib/service.(*TeleportProcess).initAuthService.func7
        github.com/gravitational/teleport/lib/service/supervisor.go:550 github.com/gravitational/teleport/lib/service.(*LocalService).Serve
        github.com/gravitational/teleport/lib/service/supervisor.go:281 github.com/gravitational/teleport/lib/service.(*LocalSupervisor).serve.func1
        runtime/asm_arm64.s:1197 runtime.goexit
User Message: context canceled] pid:2557.1 service:auth.server_info service/supervisor.go:286
2023-10-25T13:25:49-06:00 INFO [PROC:1]    Signaled success to parent process. pid:2574.1 service/signals.go:211
2023-10-25T13:25:49-06:00 INFO [PROXY:SER] Shutting down gracefully. pid:2557.1 service/service.go:4721
2023-10-25T13:25:49-06:00 INFO [DIAG:1]    Shutting down gracefully. pid:2557.1 service/service.go:3156
2023-10-25T13:25:49-06:00 INFO [DIAG:1]    Exited. pid:2557.1 service/service.go:3160
2023-10-25T13:25:49-06:00 INFO [AUTH:1]    Shutting down immediately (auth service does not currently support graceful shutdown). pid:2557.1 service/service.go:2110
2023-10-25T13:25:49-06:00 INFO [UPLOAD:1]  File upload completer is shutting down. pid:2557.1 service/service.go:2945
2023-10-25T13:25:49-06:00 INFO [UPLOAD:1]  File upload completer has shut down. pid:2557.1 service/service.go:2947
2023-10-25T13:25:49-06:00 INFO [UPLOAD:1]  File uploader is shutting down. pid:2557.1 service/service.go:2913
2023-10-25T13:25:49-06:00 INFO [UPLOAD:1]  File uploader has shut down. pid:2557.1 service/service.go:2915
2023-10-25T13:25:49-06:00 INFO [PROXY:SER] Exited. pid:2557.1 service/service.go:4109
2023-10-25T13:25:49-06:00 INFO [WEB]       Closing session cache. web/sessions.go:692
2023-10-25T13:25:49-06:00 INFO [WEB]       Closing session cache. web/sessions.go:692
2023-10-25T13:25:49-06:00 WARN [PROC:1]    Teleport process has exited with error. error:[
ERROR REPORT:
Original Error: *trace.ConnectionProblemError listener is closed
Stack Trace:
        github.com/gravitational/teleport/lib/srv/alpnproxy/listener.go:85 github.com/gravitational/teleport/lib/srv/alpnproxy.(*ListenerMuxWrapper).Accept
        google.golang.org/grpc@v1.58.2/server.go:859 google.golang.org/grpc.(*Server).Serve
        github.com/gravitational/teleport/lib/service/service.go:5938 github.com/gravitational/teleport/lib/service.(*TeleportProcess).initPublicGRPCServer.func1
        github.com/gravitational/teleport/lib/service/supervisor.go:550 github.com/gravitational/teleport/lib/service.(*LocalService).Serve
        github.com/gravitational/teleport/lib/service/supervisor.go:281 github.com/gravitational/teleport/lib/service.(*LocalSupervisor).serve.func1
        runtime/asm_arm64.s:1197 runtime.goexit
User Message: listener is closed] pid:2557.1 service:proxy.grpc.public service/supervisor.go:286
2023-10-25T13:25:49-06:00 WARN [PROC:1]    Teleport process has exited with error. error:[
ERROR REPORT:
Original Error: *trace.ConnectionProblemError listener is closed
Stack Trace:
        github.com/gravitational/teleport/lib/srv/alpnproxy/listener.go:85 github.com/gravitational/teleport/lib/srv/alpnproxy.(*ListenerMuxWrapper).Accept
        google.golang.org/grpc@v1.58.2/server.go:859 google.golang.org/grpc.(*Server).Serve
        github.com/gravitational/teleport/lib/service/service.go:6012 github.com/gravitational/teleport/lib/service.(*TeleportProcess).initSecureGRPCServer.func1
        github.com/gravitational/teleport/lib/service/supervisor.go:550 github.com/gravitational/teleport/lib/service.(*LocalService).Serve
        github.com/gravitational/teleport/lib/service/supervisor.go:281 github.com/gravitational/teleport/lib/service.(*LocalSupervisor).serve.func1
        runtime/asm_arm64.s:1197 runtime.goexit
User Message: listener is closed] pid:2557.1 service:proxy.grpc.secure service/supervisor.go:286
2023-10-25T13:25:49-06:00 INFO [PROXY:SER] Exited. pid:2557.1 service/service.go:4775
2023-10-25T13:25:49-06:00 INFO [AUTH:1]    Exited. pid:2557.1 service/service.go:2125
2023-10-25T13:25:49-06:00 WARN [AUTH:1]    Mux serve failed. error:[
ERROR REPORT:
Original Error: *trace.ConnectionProblemError listener is closed
Stack Trace:
        github.com/gravitational/teleport/lib/multiplexer/wrappers.go:137 github.com/gravitational/teleport/lib/multiplexer.(*Listener).Accept
        crypto/tls/tls.go:66 crypto/tls.(*listener).Accept
        github.com/gravitational/teleport/lib/multiplexer/tls.go:110 github.com/gravitational/teleport/lib/multiplexer.(*TLSListener).Serve
        github.com/gravitational/teleport/lib/auth/middleware.go:271 github.com/gravitational/teleport/lib/auth.(*TLSServer).Serve.func1
        runtime/asm_arm64.s:1197 runtime.goexit
User Message: listener is closed
        listener is closed] auth/middleware.go:272
2023-10-25T13:25:49-06:00 INFO [PROC:1]    All services stopped, exiting. pid:2557.1 service/signals.go:123
2023-10-25T13:25:49-06:00 WARN [PROC:1]    Teleport process has exited with error. error:[
ERROR REPORT:
Original Error: *trace.ConnectionProblemError error reading from server: EOF
Stack Trace:
        github.com/gravitational/teleport/api@v0.0.0/client/client.go:820 github.com/gravitational/teleport/api/client.(*Client).Ping
        github.com/gravitational/teleport/lib/service/awsoidc.go:56 github.com/gravitational/teleport/lib/service.(*TeleportProcess).initDeployServiceUpdater
        github.com/gravitational/teleport/lib/service/supervisor.go:550 github.com/gravitational/teleport/lib/service.(*LocalService).Serve
        github.com/gravitational/teleport/lib/service/supervisor.go:281 github.com/gravitational/teleport/lib/service.(*LocalSupervisor).serve.func1
        runtime/asm_arm64.s:1197 runtime.goexit
User Message: error reading from server: EOF] pid:2574.1 service:update.aws-oidc.deploy.agents service/supervisor.go:286
2023-10-25T13:25:49-06:00 WARN [PROC:1]    Non-critical service update.aws-oidc.deploy.agents has exited with error error reading from server: EOF, continuing to operate. pid:2574.1 service/signals.go:156

Not the cleanest restart. I temporarily see an unhealthy cache:

2023-10-25T13:25:49-06:00 WARN [PROXY:BEA] Heartbeat failed EOF. srv/heartbeat.go:282
2023-10-25T13:25:49-06:00 INFO [PROC:1]    Detected Teleport component "proxy" is running in a degraded state. pid:2574.1 service/state.go:106
2023-10-25T13:25:49-06:00 WARN [PROXY:1:C] Re-init the cache on error error:[
ERROR REPORT:
Original Error: *trace.ConnectionProblemError watcher is closed: error reading from server: EOF
Stack Trace:
        github.com/gravitational/teleport/lib/cache/cache.go:1185 github.com/gravitational/teleport/lib/cache.(*Cache).fetchAndWatch
        github.com/gravitational/teleport/lib/cache/cache.go:983 github.com/gravitational/teleport/lib/cache.(*Cache).update
        runtime/asm_arm64.s:1197 runtime.goexit
User Message: watcher is closed: error reading from server: EOF] cache/cache.go:989
2023-10-25T13:25:50-06:00 WARN [PROXY]     Restart watch on error: error reading from server: EOF. pid:2574.1 resource-kind:[{lock false  map[]   {} [] 0}] services/watcher.go:259
2023-10-25T13:25:51-06:00 WARN [PROXY]     Restart watch on error: watcher is closed: watcher closed. pid:2574.1 resource-kind:[{node false  map[]   {} [] 0}] services/watcher.go:259
2023-10-25T13:25:52-06:00 WARN [PROXY:1]   Restart watch on error: watcher is closed: watcher closed. pid:2574.1 resource-kind:[{proxy false  map[]   {} [] 0}] services/watcher.go:259
2023-10-25T13:25:53-06:00 WARN [PROXY]     Restart watch on error: watcher is closed: watcher closed. pid:2574.1 resource-kind:[{cert_authority false  map[]   {} [] 0}] services/watcher.go:259

But then things settle a few seconds later:

2023-10-25T13:25:54-06:00 INFO [PROC:1]    Teleport component "proxy" is recovering from a degraded state. pid:2574.1 service/state.go:120
2023-10-25T13:25:55-06:00 WARN [PROXY:SER] Restart watch on error: watcher is closed: watcher closed. pid:2574.1 resource-kind:[{kube_server false  map[]   {} [] 0}] services/watcher.go:259
2023-10-25T13:26:04-06:00 INFO [PROC:1]    Teleport component "proxy" has recovered from a degraded state. pid:2574.1 service/state.go:124