Expected behavior:
ssh service reconnects to restarted auth service quickly
Current behavior:
almost 12 minutes between node entering degraded state and being recovered
Bug details:
Teleport version: 10.1.2
Recreation steps:
We have following setup:
auth and proxy services running in separate pair of VMs, each behind an L3 load balancer
both services combined with teleport ssh service
load balancer using diag port /readyz handle as health check
After restarting auth services (in case of updating configuration), proxy service starts to report non-200 codes for /readyz handle:
first ~6 minutes it reports 503
then ~6 minutes it reports 400
So, proxy becames unavailable for users for almost 12 minutes.
If I turn off ssh service at proxy - no problem observed, it quickly recovers from auth service restart.
On ssh-only nodes no problem observed. It is somehow related to proxy role.
Debug logs
I've excluded repeating mesasages like DEBU Cluster NPE is missing 1 tunnels. A small number of missing tunnels is normal, for example, a node could have just been shut down, the proxy restarted, etc. However, if this error persists with an elevated number of missing tunnels, it often indicates nodes can not discover all registered proxies. Check that all of your proxies are behind a load balancer and the load balancer is using a round robin strategy. Some of the missing hosts: [Server(name=8ba2b2ae-844a-4765-b24b-99dd08469a2b, namespace=default, addr=, labels=map[ansible_group:undefined])]. reversetunnel/localsite.go:668
Aug 25 10:35:44 proxy-hostname teleport[392446]: 2022-08-25T10:35:44+03:00 WARN [PROC:1] Sync rotation state cycle failed. Retrying in ~10s pid:392446.1 service/connect.go:678
Aug 25 10:35:44 proxy-hostname teleport[392446]: 2022-08-25T10:35:44+03:00 WARN Inventory control stream failed: inventory control stream closed: error reading from server: EOF inventory/inventory.go:145
Aug 25 10:35:44 proxy-hostname teleport[392446]: 2022-08-25T10:35:44+03:00 DEBU Re-attempt control stream acquisition in ~2.410824443s. inventory/inventory.go:124
Aug 25 10:35:44 proxy-hostname teleport[392446]: 2022-08-25T10:35:44+03:00 WARN [NODE:1:CA] Re-init the cache on error: watcher is closed: error reading from server: EOF. cache/cache.go:775
Aug 25 10:35:44 proxy-hostname teleport[392446]: 2022-08-25T10:35:44+03:00 DEBU [NODE:1:CA] Reloading cache. cache/cache.go:779
Aug 25 10:35:44 proxy-hostname teleport[392446]: 2022-08-25T10:35:44+03:00 WARN [PROXY:1:C] Re-init the cache on error: watcher is closed: error reading from server: EOF. cache/cache.go:775
Aug 25 10:35:44 proxy-hostname teleport[392446]: 2022-08-25T10:35:44+03:00 DEBU [PROXY:1:C] Reloading cache. cache/cache.go:779
Aug 25 10:35:44 proxy-hostname teleport[392446]: 2022-08-25T10:35:44+03:00 DEBU [PROC:1] Broadcasting event. event:TeleportDegraded pid:392446.1 service/supervisor.go:383
Aug 25 10:35:44 proxy-hostname teleport[392446]: 2022-08-25T10:35:44+03:00 INFO [PROC:1] Detected Teleport component "node" is running in a degraded state. pid:392446.1 service/state.go:105
Aug 25 10:35:44 proxy-hostname teleport[392446]: 2022-08-25T10:35:44+03:00 WARN [PROC:1] Connection problem: sync rotation state: Get "https://teleport.cluster.local/v2/authorities/host/NPE?load_keys=false": dial tcp 172.23.8.70:3025: connect: connection refused. pid:392446.1 service/connect.go:779
Aug 25 10:35:44 proxy-hostname teleport[392446]: 2022-08-25T10:35:44+03:00 WARN [PROC:1] Sync rotation state cycle failed. Retrying in ~10s pid:392446.1 service/connect.go:678
Aug 25 10:35:44 proxy-hostname teleport[392446]: 2022-08-25T10:35:44+03:00 WARN [PROC:1] Connection problem: sync rotation state: Get "https://teleport.cluster.local/v2/authorities/host/NPE?load_keys=false": dial tcp 172.23.8.70:3025: connect: connection refused. pid:392446.1 service/connect.go:779
Aug 25 10:35:44 proxy-hostname teleport[392446]: 2022-08-25T10:35:44+03:00 WARN [PROC:1] Sync rotation state cycle failed. Retrying in ~10s pid:392446.1 service/connect.go:678
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [WEB] Callback start. auth:saml web/saml.go:92
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [PROXY] Attempting to restart watch after waiting 2.84735411s. pid:392446.1 resource-kind:node services/watcher.go:206
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 WARN [PROXY] Restart watch on error: watcher is closed: error reading from server: EOF. pid:392446.1 resource-kind:node services/watcher.go:213
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [PROXY] Starting watch. pid:392446.1 resource-kind:node services/watcher.go:177
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [WEB] Redirecting to web browser. auth:saml web/saml.go:126
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [PROXY:1] Attempting to restart watch after waiting 2.914732914s. pid:392446.1 resource-kind:proxy services/watcher.go:206
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 WARN [PROXY:1] Restart watch on error: watcher is closed: error reading from server: EOF. pid:392446.1 resource-kind:proxy services/watcher.go:213
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [PROXY:1] Starting watch. pid:392446.1 resource-kind:proxy services/watcher.go:177
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [PROXY:1] List of known proxies updated: ["3631ac3e-d9cd-4e4c-a400-5e3ae1bdf6cc" "8638a136-4dbb-403d-8d3c-51ac3c0eda9f" "96e1edc7-6867-425e-9ebf-70647d2cc37c" "e066751a-a312-44b3-baa3-a1942e724548"]. pid:392446.1 services/watcher.go:410
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [DISCOVERY] Sending node discovery request with proxies ["3631ac3e-d9cd-4e4c-a400-5e3ae1bdf6cc" "8638a136-4dbb-403d-8d3c-51ac3c0eda9f" "96e1edc7-6867-425e-9ebf-70647d2cc37c" "e066751a-a312-44b3-baa3-a1942e724548"] to 172.23.6.92:53416. reversetunnel/conn.go:239
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [DISCOVERY] Sending node discovery request with proxies ["3631ac3e-d9cd-4e4c-a400-5e3ae1bdf6cc" "8638a136-4dbb-403d-8d3c-51ac3c0eda9f" "96e1edc7-6867-425e-9ebf-70647d2cc37c" "e066751a-a312-44b3-baa3-a1942e724548"] to 172.23.7.92:55926. reversetunnel/conn.go:239
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [DISCOVERY] Sending app discovery request with proxies ["3631ac3e-d9cd-4e4c-a400-5e3ae1bdf6cc" "8638a136-4dbb-403d-8d3c-51ac3c0eda9f" "96e1edc7-6867-425e-9ebf-70647d2cc37c" "e066751a-a312-44b3-baa3-a1942e724548"] to 172.23.7.92:55920. reversetunnel/conn.go:239
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [DISCOVERY] Sending app discovery request with proxies ["3631ac3e-d9cd-4e4c-a400-5e3ae1bdf6cc" "8638a136-4dbb-403d-8d3c-51ac3c0eda9f" "96e1edc7-6867-425e-9ebf-70647d2cc37c" "e066751a-a312-44b3-baa3-a1942e724548"] to 172.23.6.92:53402. reversetunnel/conn.go:239
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [AUTH] ClientCertPool -> cert(NPE issued by NPE:150697127953305817417387513129116094324) auth/middleware.go:680
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [AUTH] ClientCertPool -> cert(NPE issued by NPE:332142151485100923591418454858366357672) auth/middleware.go:680
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [SSH:PROXY] Incoming connection 172.23.7.92:56214 -> 172.23.7.91:3024 version: SSH-2.0-Go, certtype: "host" pid:392446.1 sshutils/server.go:465
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [PROXY:1] Transport request: teleport-transport. pid:392446.1 reversetunnel/srv.go:635
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [PROXY:1] Received out-of-band proxy transport request for @remote-auth-server []. pid:392446.1 reversetunnel/transport.go:206
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [PROXY:1] Attempting to dial through tunnel with server ID "". pid:392446.1 reversetunnel/transport.go:377
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [PROXY:1] Attempting to dial directly "tp-auth.example.com:3025". pid:392446.1 reversetunnel/transport.go:394
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU Skipping login no-login-b1ec2fa4-04fb-48d0-b474-3c4cc7f8b079, not a valid Unix login. services/role.go:260
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [PROXY:1] Returning direct dialed connection to "tp-auth.example.com:3025". pid:392446.1 reversetunnel/transport.go:400
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [PROXY:1] Successfully dialed to @remote-auth-server "", start proxying. pid:392446.1 reversetunnel/transport.go:313
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [SSH:PROXY] Incoming connection 172.23.7.92:56210 -> 172.23.7.91:3024 version: SSH-2.0-Go, certtype: "host" pid:392446.1 sshutils/server.go:465
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [SSH:PROXY] Closed connection 172.23.7.92:56210. pid:392446.1 sshutils/server.go:470
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [SSH:PROXY] Incoming connection 172.23.7.92:56216 -> 172.23.7.91:3024 version: SSH-2.0-Go, certtype: "host" pid:392446.1 sshutils/server.go:465
Aug 25 10:35:47 proxy-hostname teleport[392446]: 2022-08-25T10:35:47+03:00 DEBU [SSH:PROXY] Closed connection 172.23.7.92:56216. pid:392446.1 sshutils/server.go:470
Aug 25 10:35:48 proxy-hostname teleport[392446]: 2022-08-25T10:35:48+03:00 DEBU [NODE:1] Attempting to restart watch after waiting 3.570612947s. pid:392446.1 resource-kind:lock services/watcher.go:206
Aug 25 10:35:48 proxy-hostname teleport[392446]: 2022-08-25T10:35:48+03:00 WARN [NODE:1] Restart watch on error: watcher is closed: error reading from server: EOF. pid:392446.1 resource-kind:lock services/watcher.go:213
Aug 25 10:35:48 proxy-hostname teleport[392446]: 2022-08-25T10:35:48+03:00 DEBU [NODE:1] Starting watch. pid:392446.1 resource-kind:lock services/watcher.go:177
Aug 25 10:35:48 proxy-hostname teleport[392446]: 2022-08-25T10:35:48+03:00 DEBU Skipping login no-login-b1ec2fa4-04fb-48d0-b474-3c4cc7f8b079, not a valid Unix login. services/role.go:260
Aug 25 10:35:48 proxy-hostname teleport[392446]: 2022-08-25T10:35:48+03:00 DEBU Skipping login no-login-b1ec2fa4-04fb-48d0-b474-3c4cc7f8b079, not a valid Unix login. services/role.go:260
Aug 25 10:35:48 proxy-hostname teleport[392446]: 2022-08-25T10:35:48+03:00 DEBU [SSH:PROXY] Incoming connection 172.23.7.92:56218 -> 172.23.7.91:3024 version: SSH-2.0-Go, certtype: "host" pid:392446.1 sshutils/server.go:465
Aug 25 10:35:48 proxy-hostname teleport[392446]: 2022-08-25T10:35:48+03:00 DEBU [SSH:PROXY] Closed connection 172.23.7.92:56218. pid:392446.1 sshutils/server.go:470
Aug 25 10:35:48 proxy-hostname teleport[392446]: 2022-08-25T10:35:48+03:00 DEBU Skipping login no-login-b1ec2fa4-04fb-48d0-b474-3c4cc7f8b079, not a valid Unix login. services/role.go:260
Aug 25 10:35:48 proxy-hostname teleport[392446]: 2022-08-25T10:35:48+03:00 DEBU Skipping login no-login-b1ec2fa4-04fb-48d0-b474-3c4cc7f8b079, not a valid Unix login. services/role.go:260
Aug 25 10:35:49 proxy-hostname teleport[392446]: 2022-08-25T10:35:49+03:00 DEBU [PROXY] Attempting to restart watch after waiting 4.203926241s. pid:392446.1 resource-kind:cert_authority services/watcher.go:206
Aug 25 10:35:49 proxy-hostname teleport[392446]: 2022-08-25T10:35:49+03:00 WARN [PROXY] Restart watch on error: watcher is closed: error reading from server: EOF. pid:392446.1 resource-kind:cert_authority services/watcher.go:213
Aug 25 10:35:49 proxy-hostname teleport[392446]: 2022-08-25T10:35:49+03:00 DEBU [PROXY] Starting watch. pid:392446.1 resource-kind:cert_authority services/watcher.go:177
Aug 25 10:35:49 proxy-hostname teleport[392446]: 2022-08-25T10:35:49+03:00 DEBU [PROXY:1:C] Initiating new watch after waiting 4.41276821s. cache/cache.go:790
Aug 25 10:35:49 proxy-hostname teleport[392446]: 2022-08-25T10:35:49+03:00 DEBU [NODE:1:CA] Initiating new watch after waiting 5.108639005s. cache/cache.go:790
Aug 25 10:35:50 proxy-hostname teleport[392446]: 2022-08-25T10:35:50+03:00 DEBU [PROXY] Attempting to restart watch after waiting 5.336694846s. pid:392446.1 resource-kind:lock services/watcher.go:206
Aug 25 10:35:50 proxy-hostname teleport[392446]: 2022-08-25T10:35:50+03:00 WARN [PROXY] Restart watch on error: watcher is closed: error reading from server: EOF. pid:392446.1 resource-kind:lock services/watcher.go:213
Aug 25 10:35:50 proxy-hostname teleport[392446]: 2022-08-25T10:35:50+03:00 DEBU [PROXY] Starting watch. pid:392446.1 resource-kind:lock services/watcher.go:177
Aug 25 10:38:13 proxy-hostname teleport[392446]: 2022-08-25T10:38:13+03:00 DEBU [SSH:PROXY] Closed connection 172.23.7.92:55906. pid:392446.1 sshutils/server.go:470
Aug 25 10:38:17 proxy-hostname teleport[392446]: 2022-08-25T10:38:17+03:00 DEBU [SSH:PROXY] Closed connection 172.23.7.92:55912. pid:392446.1 sshutils/server.go:470
Aug 25 10:38:18 proxy-hostname teleport[392446]: 2022-08-25T10:38:18+03:00 DEBU [SSH:PROXY] Closed connection 172.23.7.92:55926. pid:392446.1 sshutils/server.go:470
Aug 25 10:38:18 proxy-hostname teleport[392446]: 2022-08-25T10:38:18+03:00 DEBU [SSH:PROXY] Closed connection 172.23.7.92:55920. pid:392446.1 sshutils/server.go:470
Aug 25 10:38:18 proxy-hostname teleport[392446]: 2022-08-25T10:38:18+03:00 DEBU [PROXY:SER] Cluster agent disconnected. trace.fields:map[cluster:NPE] reversetunnel/localsite.go:519
Aug 25 10:38:18 proxy-hostname teleport[392446]: 2022-08-25T10:38:18+03:00 DEBU [DISCOVERY] Disconnecting connection to NPE 172.23.7.92:55920: agent disconnected. reversetunnel/conn.go:163
Aug 25 10:38:18 proxy-hostname teleport[392446]: 2022-08-25T10:38:18+03:00 DEBU [PROXY:SER] Cluster connection closed. trace.fields:map[cluster:NPE] reversetunnel/localsite.go:496
Aug 25 10:38:18 proxy-hostname teleport[392446]: 2022-08-25T10:38:18+03:00 DEBU [PROXY:SER] Cluster agent disconnected. trace.fields:map[cluster:NPE] reversetunnel/localsite.go:519
Aug 25 10:38:18 proxy-hostname teleport[392446]: 2022-08-25T10:38:18+03:00 DEBU [DISCOVERY] Disconnecting connection to NPE 172.23.7.92:55926: agent disconnected. reversetunnel/conn.go:163
Aug 25 10:38:18 proxy-hostname teleport[392446]: 2022-08-25T10:38:18+03:00 DEBU [PROXY:SER] Cluster connection closed. trace.fields:map[cluster:NPE] reversetunnel/localsite.go:496
Aug 25 10:38:18 proxy-hostname teleport[392446]: 2022-08-25T10:38:18+03:00 DEBU [SSH:PROXY] Closed connection 172.23.7.92:55918. pid:392446.1 sshutils/server.go:470
Aug 25 10:39:26 proxy-hostname teleport[392446]: 2022-08-25T10:39:26+03:00 DEBU 'teleport-service' group not found, not deleting users srv/usermgmt.go:292
Aug 25 10:41:31 proxy-hostname teleport[392446]: 2022-08-25T10:41:31+03:00 INFO [PROC:1] Teleport component "node" is recovering from a degraded state. pid:392446.1 service/state.go:119
Aug 25 10:44:26 proxy-hostname teleport[392446]: 2022-08-25T10:44:26+03:00 DEBU 'teleport-service' group not found, not deleting users srv/usermgmt.go:292
Aug 25 10:47:24 proxy-hostname teleport[392446]: 2022-08-25T10:47:24+03:00 INFO [PROC:1] Teleport component "node" has recovered from a degraded state. pid:392446.1 service/state.go:123
Expected behavior: ssh service reconnects to restarted auth service quickly
Current behavior: almost 12 minutes between node entering degraded state and being recovered
Bug details:
Teleport version: 10.1.2
Recreation steps:
We have following setup:
After restarting auth services (in case of updating configuration), proxy service starts to report non-200 codes for /readyz handle:
So, proxy becames unavailable for users for almost 12 minutes.
If I turn off ssh service at proxy - no problem observed, it quickly recovers from auth service restart. On ssh-only nodes no problem observed. It is somehow related to proxy role.
DEBU Cluster NPE is missing 1 tunnels. A small number of missing tunnels is normal, for example, a node could have just been shut down, the proxy restarted, etc. However, if this error persists with an elevated number of missing tunnels, it often indicates nodes can not discover all registered proxies. Check that all of your proxies are behind a load balancer and the load balancer is using a round robin strategy. Some of the missing hosts: [Server(name=8ba2b2ae-844a-4765-b24b-99dd08469a2b, namespace=default, addr=, labels=map[ansible_group:undefined])]. reversetunnel/localsite.go:668
Coonfigs:
auth