Kubernetes cluster discovery is flaky after upgrade from 14.3.3 to 14.3.4

bothra90 commented 8 months ago

Expected behavior: After discovery, the cluster should be accessible via the kubernetes service

Current behavior: The cluster is repeatedly added and removed (see logs below)

Bug details:

Teleport version: v14.3.4
Recreation steps: Setup teleport agent with discovery (EKS) and kubernetes services in AWS

Debug logs


Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [PROC:1]    Configured upgrade window exporter for external upgrader. kind=unit pid:7259.1 service/service.go:1020
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [PROC:1]    Connecting to the cluster yyyyyyy.teleport.sh with TLS client certificate. pid:7259.1 service/connect.go:209
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [PROC:1]    Connecting to the cluster yyyyyyy.teleport.sh with TLS client certificate. pid:7259.1 service/connect.go:209
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [PROC:1]    Connecting to the cluster yyyyyyy.teleport.sh with TLS client certificate. pid:7259.1 service/connect.go:209
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [PROC:1]    Connecting to the cluster yyyyyyy.teleport.sh with TLS client certificate. pid:7259.1 service/connect.go:209
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [PROC:1]    Connecting to the cluster yyyyyyy.teleport.sh with TLS client certificate. pid:7259.1 service/connect.go:209
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [PROC:1]    Instance: features loaded from auth server: Kubernetes:true App:true DB:true OIDC:true SAML:true AccessControls:true AdvancedAccessWorkflows:true Cloud:true HSM:true Desktop:true RecoveryCodes:true Plugins:true AutomaticUpgrades:true IsUsageBased:true Assist:true DeviceTrust:<enabled:true > AccessRequests:<> IdentityGovernance:true AccessList:<> AccessMonitoring:<enabled:true > ProductType:PRODUCT_TYPE_EUB Policy:<enabled:true >  pid:7259.1 service/connect.go:92
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [UPLOAD:1]  starting upload completer service pid:7259.1 service/service.go:2843
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [UPLOAD:1]  Creating directory /var/lib/teleport/log. pid:7259.1 service/service.go:2859
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [INSTANCE:] Successfully registered instance client. pid:7259.1 service/service.go:2436
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [UPLOAD:1]  Creating directory /var/lib/teleport/log/upload. pid:7259.1 service/service.go:2859
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [PROC:1]    Reusing Instance client for Kube. additionalSystemRoles=[Node Discovery Kube App] pid:7259.1 service/connect.go:1058
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [PROC:1]    Reusing Instance client for App. additionalSystemRoles=[Node Discovery Kube App] pid:7259.1 service/connect.go:1058
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [PROC:1]    Reusing Instance client for Node. additionalSystemRoles=[Node Discovery Kube App] pid:7259.1 service/connect.go:1058
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [PROC:1]    Reusing Instance client for Discovery. additionalSystemRoles=[Node Discovery Kube App] pid:7259.1 service/connect.go:1058
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [UPLOAD:1]  Creating directory /var/lib/teleport/log/upload/streaming. pid:7259.1 service/service.go:2859
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [UPLOAD:1]  Creating directory /var/lib/teleport/log/upload/streaming/default. pid:7259.1 service/service.go:2859
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [UPLOAD:1]  Creating directory /var/lib/teleport/log. pid:7259.1 service/service.go:2859
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [UPLOAD:1]  Creating directory /var/lib/teleport/log/upload. pid:7259.1 service/service.go:2859
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [UPLOAD:1]  Creating directory /var/lib/teleport/log/upload/corrupted. pid:7259.1 service/service.go:2859
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [UPLOAD:1]  Creating directory /var/lib/teleport/log/upload/corrupted/default. pid:7259.1 service/service.go:2859
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [UPLOAD]    uploader will scan /var/lib/teleport/log/upload/streaming/default every 5s filesessions/fileasync.go:192
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [UPLOAD:1:] upload completer will run every 5m0s events/complete.go:143
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [PROC:1]    Kube: features loaded from auth server: Kubernetes:true App:true DB:true OIDC:true SAML:true AccessControls:true AdvancedAccessWorkflows:true Cloud:true HSM:true Desktop:true RecoveryCodes:true Plugins:true AutomaticUpgrades:true IsUsageBased:true Assist:true DeviceTrust:<enabled:true > AccessRequests:<> IdentityGovernance:true AccessList:<> AccessMonitoring:<enabled:true > ProductType:PRODUCT_TYPE_EUB Policy:<enabled:true >  pid:7259.1 service/connect.go:92
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [PROC:1]    App: features loaded from auth server: Kubernetes:true App:true DB:true OIDC:true SAML:true AccessControls:true AdvancedAccessWorkflows:true Cloud:true HSM:true Desktop:true RecoveryCodes:true Plugins:true AutomaticUpgrades:true IsUsageBased:true Assist:true DeviceTrust:<enabled:true > AccessRequests:<> IdentityGovernance:true AccessList:<> AccessMonitoring:<enabled:true > ProductType:PRODUCT_TYPE_EUB Policy:<enabled:true >  pid:7259.1 service/connect.go:92
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [PROC:1]    Discovery: features loaded from auth server: Kubernetes:true App:true DB:true OIDC:true SAML:true AccessControls:true AdvancedAccessWorkflows:true Cloud:true HSM:true Desktop:true RecoveryCodes:true Plugins:true AutomaticUpgrades:true IsUsageBased:true Assist:true DeviceTrust:<enabled:true > AccessRequests:<> IdentityGovernance:true AccessList:<> AccessMonitoring:<enabled:true > ProductType:PRODUCT_TYPE_EUB Policy:<enabled:true >  pid:7259.1 service/connect.go:92
Feb 14 18:56:18 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:18Z INFO [PROC:1]    Node: features loaded from auth server: Kubernetes:true App:true DB:true OIDC:true SAML:true AccessControls:true AdvancedAccessWorkflows:true Cloud:true HSM:true Desktop:true RecoveryCodes:true Plugins:true AutomaticUpgrades:true IsUsageBased:true Assist:true DeviceTrust:<enabled:true > AccessRequests:<> IdentityGovernance:true AccessList:<> AccessMonitoring:<enabled:true > ProductType:PRODUCT_TYPE_EUB Policy:<enabled:true >  pid:7259.1 service/connect.go:92
Feb 14 18:56:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:19Z INFO [KUBERNETE] Cache "kube" first init succeeded. cache/cache.go:980
Feb 14 18:56:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:19Z INFO [KUBERNETE] Started reverse tunnel client. pid:7259.1 service/kubernetes.go:148
Feb 14 18:56:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:19Z INFO [APP:SERVI] Cache "apps" first init succeeded. cache/cache.go:980
Feb 14 18:56:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:19Z INFO [KUBERNETE] Starting Kube service via proxy reverse tunnel. pid:7259.1 service/kubernetes.go:252
Feb 14 18:56:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:19Z INFO [KUBERNETE] kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832 matches, creating. pid:7259.1 services/reconciler.go:162
Feb 14 18:56:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:19Z INFO             ALPN connection upgrade test failed for "yyyyyyy.teleport.sh:443": context canceled. client/alpn_conn_upgrade.go:87
Feb 14 18:56:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:19Z INFO [DISCOVERY] Cache "discovery" first init succeeded. cache/cache.go:980
Feb 14 18:56:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:19Z INFO [DISCOVERY] Discovery service has successfully started pid:7259.1 service/discovery.go:95
Feb 14 18:56:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:19Z INFO [DISCOVERY] Starting watcher. kind:db pid:7259.1 common/watcher.go:109
Feb 14 18:56:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:19Z INFO [DISCOVERY] Starting watcher. kind:kube_cluster pid:7259.1 common/watcher.go:109
Feb 14 18:56:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:19Z INFO [DISCOVERY] db postgres-db-cluster-20240214054311452200000002-rds-aurora-us-west-1-824489454832 removed, deleting. kind:db pid:7259.1 services/reconciler.go:144
Feb 14 18:56:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:19Z INFO [NODE:1:CA] Cache "node" first init succeeded. cache/cache.go:980
Feb 14 18:56:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:19Z INFO             debug -> starting control-stream based heartbeat. regular/sshserver.go:890
Feb 14 18:56:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:19Z INFO [NODE:1]    Service is starting in tunnel mode. pid:7259.1 service/service.go:2710
Feb 14 18:56:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:19Z INFO [APP:SERVI] All applications successfully started. pid:7259.1 service/service.go:5372
Feb 14 18:56:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:19Z INFO [PROC:1]    The new service has started successfully. Starting syncing rotation status with period 10m0s. pid:7259.1 service/connect.go:685
Feb 14 18:56:28 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:28Z INFO [NODE]      Creating (interactive) session 011a8d76-90ea-440d-ba2e-184177a2452d. id:1 local:10.42.224.38:59516 login:ec2-user remote:117.216.170.235:54852 teleportUser:xxxxx@ srv/sess.go:339
Feb 14 18:56:28 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:28Z INFO [SESSION:N] New party ServerContext(117.216.170.235:54852->10.42.224.38:59516, user=ec2-user, id=1) party(id=7bd14298-5439-453b-ab12-41fe0b75197e) joined the session with participant mode: peer. session_id:011a8d76-90ea-440d-ba2e-184177a2452d srv/sess.go:1849
Feb 14 18:56:28 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:28Z INFO [AUDIT]     session.start addr.remote:117.216.170.235:54852 cluster_name:yyyyyyy.teleport.sh code:T2000I ei:0 event:session.start initial_command:[] login:ec2-user namespace:default private_key_policy:none proto:ssh server_hostname:ip-10-42-224-38.us-west-1.compute.internal server_id:eae7534d-5d61-45fa-a778-60df58b93d01 /cluster-id:teleport session_recording:node sid:011a8d76-90ea-440d-ba2e-184177a2452d size:80:25 time:2024-02-14T18:56:28.461Z uid:1a6a9c03-fd67-4681-b224-8773929a28d1 user:xxxxx@yyyyyyy.ai user_kind:1 events/emitter.go:274
Feb 14 18:56:48 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:48Z INFO [KUBERNETE] kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832 removed, deleting. pid:7259.1 services/reconciler.go:144
Feb 14 18:56:54 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T18:56:54Z INFO             Successfully synced "unit" upgrader maintenance window value. upgradewindow/upgradewindow.go:296
Feb 14 18:58:52 ip-10-42-224-38.us-west-1.compute.internal sudo[7363]: ec2-user : TTY=pts/0 ; PWD=/home/ec2-user ; USER=root ; COMMAND=/usr/bin/journalctl -fu teleport
Feb 14 19:00:48 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:00:48Z INFO [SESSION:N] Closing party 7bd14298-5439-453b-ab12-41fe0b75197e srv/sess.go:2018
Feb 14 19:00:48 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:00:48Z INFO [SESSION:N] Removing party ServerContext(117.216.170.235:54852->10.42.224.38:59516, user=ec2-user, id=1) party(id=7bd14298-5439-453b-ab12-41fe0b75197e) from session session_id:011a8d76-90ea-440d-ba2e-184177a2452d srv/sess.go:1518
Feb 14 19:00:49 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:00:49Z INFO [AUDIT]     session.leave cluster_name:yyyyyyy.teleport.sh code:T2003I ei:43 event:session.leave login:ec2-user namespace:default private_key_policy:none server_hostname:ip-10-42-224-38.us-west-1.compute.internal server_id:eae7534d-5d61-45fa-a778-60df58b93d01 /cluster-id:teleport sid:011a8d76-90ea-440d-ba2e-184177a2452d time:2024-02-14T19:00:49.014Z uid:12fb2c1d-9b4f-438e-9d95-ac1d98ba38f0 user:xxxxx@yyyyyyy.ai user_kind:1 events/emitter.go:274
Feb 14 19:00:49 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:00:49Z INFO [AUDIT]     session.data addr.remote:117.216.170.235:54852 code:T2006I ei:2.147483646e+09 event:session.data login:ec2-user namespace:default private_key_policy:none rx:12238 server_hostname:ip-10-42-224-38.us-west-1.compute.internal server_id:eae7534d-5d61-45fa-a778-60df58b93d01 sid:011a8d76-90ea-440d-ba2e-184177a2452d time:2024-02-14T19:00:49.084Z tx:11140 uid:845392de-e9e0-4c02-a7c9-0193346fa96f user:xxxxx@ user_kind:1 events/emitter.go:274
Feb 14 19:00:54 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:00:54Z ERRO [SESSION:N] Timed out waiting for PTY copy to finish, session data  may be missing. session_id:011a8d76-90ea-440d-ba2e-184177a2452d srv/sess.go:1326
Feb 14 19:00:54 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:00:54Z INFO [SESSION:N] Stopping session session_id:011a8d76-90ea-440d-ba2e-184177a2452d srv/sess.go:853
Feb 14 19:00:54 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:00:54Z INFO [AUDIT]     session.end addr.remote:117.216.170.235:54852 cluster_name:yyyyyyy.teleport.sh code:T2004I ei:44 enhanced_recording:false event:session.end interactive:true login:ec2-user namespace:default participants:[xxxxx@] private_key_policy:none proto:ssh server_hostname:ip-10-42-224-38.us-west-1.compute.internal server_id:eae7534d-5d61-45fa-a778-60df58b93d01 yyyyyyy.ai/cluster-id:teleport session_recording:node session_start:2024-02-14T18:56:28.432467852Z session_stop:2024-02-14T19:00:54.014596835Z sid:011a8d76-90ea-440d-ba2e-184177a2452d time:2024-02-14T19:00:54.015Z uid:93279196-4726-4898-884c-afbcb4815290 user:xxxxx@yyyyyyy.ai user_kind:1 events/emitter.go:274
Feb 14 19:00:54 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:00:54Z INFO [SESSION:N] Closing session session_id:011a8d76-90ea-440d-ba2e-184177a2452d srv/sess.go:907
Feb 14 19:01:02 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:01:02Z INFO [NODE]      Creating (interactive) session 4c3b30e6-7415-4cfc-a3b8-d50cfcd67a33. id:2 local:10.42.224.38:59516 login:root remote:107.3.182.253:52936 teleportUser:xxxxx@ srv/sess.go:339
Feb 14 19:01:02 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:01:02Z INFO [SESSION:N] New party ServerContext(107.3.182.253:52936->10.42.224.38:59516, user=root, id=2) party(id=a00093f7-8161-4c27-bf76-e9921e8540c8) joined the session with participant mode: peer. session_id:4c3b30e6-7415-4cfc-a3b8-d50cfcd67a33 srv/sess.go:1849
Feb 14 19:01:02 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:01:02Z INFO [AUDIT]     session.start addr.remote:107.3.182.253:52936 cluster_name:yyyyyyy.teleport.sh code:T2000I ei:0 event:session.start initial_command:[] login:root namespace:default private_key_policy:none proto:ssh server_hostname:ip-10-42-224-38.us-west-1.compute.internal server_id:eae7534d-5d61-45fa-a778-60df58b93d01 /cluster-id:teleport session_recording:node sid:4c3b30e6-7415-4cfc-a3b8-d50cfcd67a33 size:80:25 time:2024-02-14T19:01:02.867Z uid:b3be15aa-bd09-4341-8815-992b57fedcb0 user:xxxxx@yyyyyyy.ai user_kind:1 events/emitter.go:274
Feb 14 19:01:10 ip-10-42-224-38.us-west-1.compute.internal sudo[7466]:     root : TTY=pts/0 ; PWD=/root ; USER=root ; COMMAND=/usr/bin/journalctl -u teleprot
Feb 14 19:01:10 ip-10-42-224-38.us-west-1.compute.internal sudo[7466]: pam_unix(sudo:session): session opened for user root(uid=0) by root(uid=0)
Feb 14 19:01:10 ip-10-42-224-38.us-west-1.compute.internal sudo[7466]: pam_systemd(sudo:session): Failed to release session: Interrupted system call
Feb 14 19:01:10 ip-10-42-224-38.us-west-1.compute.internal sudo[7466]: pam_unix(sudo:session): session closed for user root
Feb 14 19:01:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:01:19Z INFO [DISCOVERY] kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832 matches, creating. kind:kube_cluster pid:7259.1 services/reconciler.go:162
Feb 14 19:01:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:01:19Z INFO [KUBERNETE] kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832 matches, creating. pid:7259.1 services/reconciler.go:162
Feb 14 19:01:48 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:01:48Z INFO [KUBERNETE] kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832 removed, deleting. pid:7259.1 services/reconciler.go:144
Feb 14 19:02:17 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:02:17Z INFO [NODE]      Creating (interactive) session 34dd5b63-b669-4723-a666-e6e87791845c. id:3 local:10.42.224.38:59504 login:ec2-user remote:117.216.170.235:55227 teleportUser:foo@ srv/sess.go:339
Feb 14 19:02:17 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:02:17Z INFO [SESSION:N] New party ServerContext(117.216.170.235:55227->10.42.224.38:59504, user=ec2-user, id=3) party(id=2676c5aa-8fc6-4782-825c-5dd16a014962) joined the session with participant mode: peer. session_id:34dd5b63-b669-4723-a666-e6e87791845c srv/sess.go:1849
Feb 14 19:02:17 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:02:17Z INFO [AUDIT]     session.start addr.remote:117.216.170.235:55227 cluster_name:yyyyyyy.teleport.sh code:T2000I ei:0 event:session.start initial_command:[] login:ec2-user namespace:default private_key_policy:none proto:ssh server_hostname:ip-10-42-224-38.us-west-1.compute.internal server_id:eae7534d-5d61-45fa-a778-60df58b93d01 /cluster-id:teleport session_recording:node sid:34dd5b63-b669-4723-a666-e6e87791845c size:80:25 time:2024-02-14T19:02:17.271Z uid:f8e67c57-e7ef-4720-bdea-3e445ee1e93e user:foo@yyyyyyy.ai user_kind:1 events/emitter.go:274
Feb 14 19:02:51 ip-10-42-224-38.us-west-1.compute.internal sudo[7624]: ec2-user : TTY=pts/1 ; PWD=/home/ec2-user ; USER=root ; COMMAND=/usr/bin/journalctl -fu teleport
Feb 14 19:05:32 ip-10-42-224-38.us-west-1.compute.internal sudo[7742]: ec2-user : TTY=pts/1 ; PWD=/home/ec2-user ; USER=root ; COMMAND=/usr/bin/journalctl -fu teleport
Feb 14 19:06:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:06:19Z INFO [DISCOVERY] kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832 matches, creating. kind:kube_cluster pid:7259.1 services/reconciler.go:162
Feb 14 19:06:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:06:19Z INFO [KUBERNETE] kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832 matches, creating. pid:7259.1 services/reconciler.go:162
Feb 14 19:06:48 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:06:48Z INFO [KUBERNETE] kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832 removed, deleting. pid:7259.1 services/reconciler.go:144
Feb 14 19:07:10 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:07:10Z INFO [NODE]      Creating (interactive) session 5ca3bbbe-67b3-4970-ae2f-def440dbb7bc. id:4 local:10.42.224.38:59516 login:ec2-user remote:117.216.170.235:55527 teleportUser:xxxxx@ srv/sess.go:339
Feb 14 19:07:10 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:07:10Z INFO [SESSION:N] New party ServerContext(117.216.170.235:55527->10.42.224.38:59516, user=ec2-user, id=4) party(id=587fcdc3-05d0-460d-9e29-6761491d04e7) joined the session with participant mode: peer. session_id:5ca3bbbe-67b3-4970-ae2f-def440dbb7bc srv/sess.go:1849
Feb 14 19:07:10 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:07:10Z INFO [AUDIT]     session.start addr.remote:117.216.170.235:55527 cluster_name:yyyyyyy.teleport.sh code:T2000I ei:0 event:session.start initial_command:[] login:ec2-user namespace:default private_key_policy:none proto:ssh server_hostname:ip-10-42-224-38.us-west-1.compute.internal server_id:eae7534d-5d61-45fa-a778-60df58b93d01 /cluster-id:teleport session_recording:node sid:5ca3bbbe-67b3-4970-ae2f-def440dbb7bc size:80:25 time:2024-02-14T19:07:10.039Z uid:5f848560-ae51-476c-949a-38230f1d8a5f user:xxxxx@yyyyyyy.ai user_kind:1 events/emitter.go:274
Feb 14 19:09:36 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:09:36Z INFO [SESSION:N] Closing party 587fcdc3-05d0-460d-9e29-6761491d04e7 srv/sess.go:2018
Feb 14 19:09:36 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:09:36Z INFO [SESSION:N] Removing party ServerContext(117.216.170.235:55527->10.42.224.38:59516, user=ec2-user, id=4) party(id=587fcdc3-05d0-460d-9e29-6761491d04e7) from session session_id:5ca3bbbe-67b3-4970-ae2f-def440dbb7bc srv/sess.go:1518
Feb 14 19:09:36 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:09:36Z INFO [AUDIT]     session.leave cluster_name:yyyyyyy.teleport.sh code:T2003I ei:37 event:session.leave login:ec2-user namespace:default private_key_policy:none server_hostname:ip-10-42-224-38.us-west-1.compute.internal server_id:eae7534d-5d61-45fa-a778-60df58b93d01 /cluster-id:teleport sid:5ca3bbbe-67b3-4970-ae2f-def440dbb7bc time:2024-02-14T19:09:36.963Z uid:3b650b13-9de1-4a13-aa99-e3ad3caf5486 user:xxxxx@yyyyyyy.ai user_kind:1 events/emitter.go:274
Feb 14 19:09:37 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:09:37Z INFO [AUDIT]     session.data addr.remote:117.216.170.235:55527 code:T2006I ei:2.147483646e+09 event:session.data login:ec2-user namespace:default private_key_policy:none rx:8146 server_hostname:ip-10-42-224-38.us-west-1.compute.internal server_id:eae7534d-5d61-45fa-a778-60df58b93d01 sid:5ca3bbbe-67b3-4970-ae2f-def440dbb7bc time:2024-02-14T19:09:37.037Z tx:8944 uid:3a033a80-cb25-4520-ac42-ec77624ba7fb user:xxxxx@ user_kind:1 events/emitter.go:274
Feb 14 19:09:41 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:09:41Z ERRO [SESSION:N] Timed out waiting for PTY copy to finish, session data  may be missing. session_id:5ca3bbbe-67b3-4970-ae2f-def440dbb7bc srv/sess.go:1326
Feb 14 19:09:41 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:09:41Z INFO [SESSION:N] Stopping session session_id:5ca3bbbe-67b3-4970-ae2f-def440dbb7bc srv/sess.go:853
Feb 14 19:09:41 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:09:41Z INFO [AUDIT]     session.end addr.remote:117.216.170.235:55527 cluster_name:yyyyyyy.teleport.sh code:T2004I ei:38 enhanced_recording:false event:session.end interactive:true login:ec2-user namespace:default participants:[xxxxx@] private_key_policy:none proto:ssh server_hostname:ip-10-42-224-38.us-west-1.compute.internal server_id:eae7534d-5d61-45fa-a778-60df58b93d01 yyyyyyy.ai/cluster-id:teleport session_recording:node session_start:2024-02-14T19:07:10.011386528Z session_stop:2024-02-14T19:09:41.963895828Z sid:5ca3bbbe-67b3-4970-ae2f-def440dbb7bc time:2024-02-14T19:09:41.964Z uid:815a2511-0655-4336-b693-cf62cda5bcf8 user:xxxxx@yyyyyyy.ai user_kind:1 events/emitter.go:274
Feb 14 19:09:41 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:09:41Z INFO [SESSION:N] Closing session session_id:5ca3bbbe-67b3-4970-ae2f-def440dbb7bc srv/sess.go:907


In particular, the following two lines repeat over and over:

Feb 14 19:01:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:01:19Z INFO [KUBERNETE] kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832 matches, creating. pid:7259.1 services/reconciler.go:162 Feb 14 19:01:48 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:01:48Z INFO [KUBERNETE] kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832 removed, deleting. pid:7259.1 services/reconciler.go:144

AntonAM commented 8 months ago

You didn't have any config changes for the discovery service? Also could you share this config?

bothra90 commented 8 months ago

Nothing was changed in the discovery service config. Here's the full config we use:

version: v3
teleport:
  data_dir: /var/lib/teleport
  join_params:
    method: iam
    token_name: outpost-token
  proxy_server: fennel.teleport.sh:443
  log:
    output: stderr
    severity: INFO
    format:
      output: text
  ca_pin: sha256:bc2783105140465fa95eac5e3748d1ad7bb12c39e39b40f0fb3d3727ff01d286
  diag_addr: ""
ssh_service:
  enabled: "yes"
  commands:
  - name: "fennel.ai/cluster-id"
    command: ['echo', '%%FENNEL_CLUSTER_ID%%']
    period: 1m0s
discovery_service:
  enabled: "yes"
  discovery_group: "aws-prod"
  aws:
   - types: ["eks"]
     regions: ["%%REGION%%"]
     tags:
       "managed-by": "fennel.ai"
       "fennel.ai/cluster-id": "%%FENNEL_CLUSTER_ID%%"
kubernetes_service:
  enabled: "yes"
  resources:
  - labels:
      fennel.ai/cluster-id: %%FENNEL_CLUSTER_ID%%
app_service:
  enabled: "yes"
  apps:
  - name: "%%FENNEL_CLUSTER_ID%%-aws-console"
    uri: "https://console.aws.amazon.com/ec2/v2/home"
    labels:
      fennel.ai/cluster-id: %%FENNEL_CLUSTER_ID%%
# Explicitly disabled
auth_service:
  enabled: "no"
proxy_service:
  enabled: "no"
  https_keypairs: []
  https_keypairs_reload_interval: 0s
  acme: {}

AntonAM commented 8 months ago

@bothra90 I see that you have two kube agents connected to the auth. Is it intentional? Maybe when you upgraded the discovery server you started new one, but left old one running?

bothra90 commented 8 months ago

We had two nodes, both running almost the same conf as above. I have shut down one of them, but still seeing some errors:

2024-02-17T01:17:42Z INFO [KUBERNETE] Starting Kube service via proxy reverse tunnel. pid:112890.1 service/kubernetes.go:252
2024-02-17T01:17:42Z INFO [DISCOVERY] kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832 matches, creating. kind:kube_cluster pid:112890.1 services/reconciler.go:162
2024-02-17T01:17:42Z WARN [DISCOVERY] Unable to reconcile resources. error:[
ERROR REPORT:
Original Error: trace.aggregate failed to create kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832
        kubernetes cluster &#34;eks-cluster-eksCluster-85868e8-us-west-1-824489454832&#34; doesn&#39;t exist
Stack Trace:
        github.com/gravitational/teleport/lib/services/reconciler.go:131 github.com/gravitational/teleport/lib/services.(*Reconciler[...]).Reconcile
        github.com/gravitational/teleport/lib/srv/discovery/kube_watcher.go:99 github.com/gravitational/teleport/lib/srv/discovery.(*Server).startKubeWatchers.func4
        runtime/asm_arm64.s:1197 runtime.goexit
User Message: failed to create kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832
        kubernetes cluster &#34;eks-cluster-eksCluster-85868e8-us-west-1-824489454832&#34; doesn&#39;t exist] pid:112890.1 discovery/kube_watcher.go:100

bothra90 commented 8 months ago

Even if we have multiple discovery servers running, shouldn't the "discovery_group" lead to resources getting dedup-ed?

bothra90 commented 8 months ago

@AntonAM : got some debug logs from the teleport agent. There's not that much new information here, but sharing anyway.

2024-02-17T06:16:08Z DEBU [DISCOVERY] EKS cluster status is valid: ACTIVE cluster_name:eks-cluster-eksCluster-85868e8 pid:6577.1 fetchers/eks.go:228
2024-02-17T06:16:08Z DEBU [DISCOVERY] Reconciling 0 current resources with 1 new resources. kind:kube_cluster pid:6577.1 services/reconciler.go:112
2024-02-17T06:16:08Z INFO [DISCOVERY] kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832 matches, creating. kind:kube_cluster pid:6577.1 services/reconciler.go:162
2024-02-17T06:16:08Z DEBU [DISCOVERY] Creating kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832. pid:6577.1 discovery/kube_watcher.go:112
2024-02-17T06:16:08Z DEBU [DISCOVERY] Updating kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832. pid:6577.1 discovery/kube_watcher.go:141
2024-02-17T06:16:08Z WARN [DISCOVERY] Unable to reconcile resources. error:[
ERROR REPORT:
Original Error: trace.aggregate failed to create kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832
        kubernetes cluster &#34;eks-cluster-eksCluster-85868e8-us-west-1-824489454832&#34; doesn&#39;t exist
Stack Trace:
        github.com/gravitational/teleport/lib/services/reconciler.go:131 github.com/gravitational/teleport/lib/services.(*Reconciler[...]).Reconcile
        github.com/gravitational/teleport/lib/srv/discovery/kube_watcher.go:99 github.com/gravitational/teleport/lib/srv/discovery.(*Server).startKubeWatchers.func4
        runtime/asm_arm64.s:1197 runtime.goexit
User Message: failed to create kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832
        kubernetes cluster &#34;eks-cluster-eksCluster-85868e8-us-west-1-824489454832&#34; doesn&#39;t exist] pid:6577.1 discovery/kube_watcher.go:100

AntonAM commented 8 months ago

@bothra90 yes, it should deduplicate, or rather not try to change identical resources. But it looks like one of the discovery services didn't actually see eks clusters for some reason, so it ended up that one service was creating it and another one deleting it. Regarding further errors, could you run command tctl get kube_clusters and show its result here? (with a user that has sufficient permissions to get this data)

zmb3 commented 7 months ago

Closing due to inactivity.

gravitational / teleport

Kubernetes cluster discovery is flaky after upgrade from 14.3.3 to 14.3.4 #38235