gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.57k stars 1.76k forks source link

Kubernetes cluster discovery is flaky after upgrade from 14.3.3 to 14.3.4 #38235

Closed bothra90 closed 7 months ago

bothra90 commented 8 months ago

Expected behavior: After discovery, the cluster should be accessible via the kubernetes service

Current behavior: The cluster is repeatedly added and removed (see logs below)

Bug details:


In particular, the following two lines repeat over and over:

Feb 14 19:01:19 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:01:19Z INFO [KUBERNETE] kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832 matches, creating. pid:7259.1 services/reconciler.go:162 Feb 14 19:01:48 ip-10-42-224-38.us-west-1.compute.internal teleport[7259]: 2024-02-14T19:01:48Z INFO [KUBERNETE] kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832 removed, deleting. pid:7259.1 services/reconciler.go:144

AntonAM commented 8 months ago

You didn't have any config changes for the discovery service? Also could you share this config?

bothra90 commented 8 months ago

Nothing was changed in the discovery service config. Here's the full config we use:

version: v3
teleport:
  data_dir: /var/lib/teleport
  join_params:
    method: iam
    token_name: outpost-token
  proxy_server: fennel.teleport.sh:443
  log:
    output: stderr
    severity: INFO
    format:
      output: text
  ca_pin: sha256:bc2783105140465fa95eac5e3748d1ad7bb12c39e39b40f0fb3d3727ff01d286
  diag_addr: ""
ssh_service:
  enabled: "yes"
  commands:
  - name: "fennel.ai/cluster-id"
    command: ['echo', '%%FENNEL_CLUSTER_ID%%']
    period: 1m0s
discovery_service:
  enabled: "yes"
  discovery_group: "aws-prod"
  aws:
   - types: ["eks"]
     regions: ["%%REGION%%"]
     tags:
       "managed-by": "fennel.ai"
       "fennel.ai/cluster-id": "%%FENNEL_CLUSTER_ID%%"
kubernetes_service:
  enabled: "yes"
  resources:
  - labels:
      fennel.ai/cluster-id: %%FENNEL_CLUSTER_ID%%
app_service:
  enabled: "yes"
  apps:
  - name: "%%FENNEL_CLUSTER_ID%%-aws-console"
    uri: "https://console.aws.amazon.com/ec2/v2/home"
    labels:
      fennel.ai/cluster-id: %%FENNEL_CLUSTER_ID%%
# Explicitly disabled
auth_service:
  enabled: "no"
proxy_service:
  enabled: "no"
  https_keypairs: []
  https_keypairs_reload_interval: 0s
  acme: {}
AntonAM commented 8 months ago

@bothra90 I see that you have two kube agents connected to the auth. Is it intentional? Maybe when you upgraded the discovery server you started new one, but left old one running?

bothra90 commented 8 months ago

We had two nodes, both running almost the same conf as above. I have shut down one of them, but still seeing some errors:

2024-02-17T01:17:42Z INFO [KUBERNETE] Starting Kube service via proxy reverse tunnel. pid:112890.1 service/kubernetes.go:252
2024-02-17T01:17:42Z INFO [DISCOVERY] kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832 matches, creating. kind:kube_cluster pid:112890.1 services/reconciler.go:162
2024-02-17T01:17:42Z WARN [DISCOVERY] Unable to reconcile resources. error:[
ERROR REPORT:
Original Error: trace.aggregate failed to create kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832
        kubernetes cluster "eks-cluster-eksCluster-85868e8-us-west-1-824489454832" doesn't exist
Stack Trace:
        github.com/gravitational/teleport/lib/services/reconciler.go:131 github.com/gravitational/teleport/lib/services.(*Reconciler[...]).Reconcile
        github.com/gravitational/teleport/lib/srv/discovery/kube_watcher.go:99 github.com/gravitational/teleport/lib/srv/discovery.(*Server).startKubeWatchers.func4
        runtime/asm_arm64.s:1197 runtime.goexit
User Message: failed to create kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832
        kubernetes cluster "eks-cluster-eksCluster-85868e8-us-west-1-824489454832" doesn't exist] pid:112890.1 discovery/kube_watcher.go:100
bothra90 commented 8 months ago

Even if we have multiple discovery servers running, shouldn't the "discovery_group" lead to resources getting dedup-ed?

bothra90 commented 8 months ago

@AntonAM : got some debug logs from the teleport agent. There's not that much new information here, but sharing anyway.

2024-02-17T06:16:08Z DEBU [DISCOVERY] EKS cluster status is valid: ACTIVE cluster_name:eks-cluster-eksCluster-85868e8 pid:6577.1 fetchers/eks.go:228
2024-02-17T06:16:08Z DEBU [DISCOVERY] Reconciling 0 current resources with 1 new resources. kind:kube_cluster pid:6577.1 services/reconciler.go:112
2024-02-17T06:16:08Z INFO [DISCOVERY] kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832 matches, creating. kind:kube_cluster pid:6577.1 services/reconciler.go:162
2024-02-17T06:16:08Z DEBU [DISCOVERY] Creating kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832. pid:6577.1 discovery/kube_watcher.go:112
2024-02-17T06:16:08Z DEBU [DISCOVERY] Updating kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832. pid:6577.1 discovery/kube_watcher.go:141
2024-02-17T06:16:08Z WARN [DISCOVERY] Unable to reconcile resources. error:[
ERROR REPORT:
Original Error: trace.aggregate failed to create kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832
        kubernetes cluster "eks-cluster-eksCluster-85868e8-us-west-1-824489454832" doesn't exist
Stack Trace:
        github.com/gravitational/teleport/lib/services/reconciler.go:131 github.com/gravitational/teleport/lib/services.(*Reconciler[...]).Reconcile
        github.com/gravitational/teleport/lib/srv/discovery/kube_watcher.go:99 github.com/gravitational/teleport/lib/srv/discovery.(*Server).startKubeWatchers.func4
        runtime/asm_arm64.s:1197 runtime.goexit
User Message: failed to create kube_cluster eks-cluster-eksCluster-85868e8-us-west-1-824489454832
        kubernetes cluster "eks-cluster-eksCluster-85868e8-us-west-1-824489454832" doesn't exist] pid:6577.1 discovery/kube_watcher.go:100
AntonAM commented 8 months ago

@bothra90 yes, it should deduplicate, or rather not try to change identical resources. But it looks like one of the discovery services didn't actually see eks clusters for some reason, so it ended up that one service was creating it and another one deleting it. Regarding further errors, could you run command tctl get kube_clusters and show its result here? (with a user that has sufficient permissions to get this data)

zmb3 commented 7 months ago

Closing due to inactivity.