tilt-dev / tilt

Define your dev environment as code. For microservice apps on Kubernetes.
https://tilt.dev/
Apache License 2.0
7.41k stars 290 forks source link

Cluster status error since 0.33.9 with eks cluster #6391

Open brian-bk opened 3 weeks ago

brian-bk commented 3 weeks ago

Expected Behavior

Tilt should be able to connect to the cluster on tilt up etc.

Current Behavior

Tilt is unable to connect to the cluster directly. We still see tilt managing local_resources and our Tiltfile executes some kubectl commands manually via local or local_resource, but the managed k8s resources behind a helm_resource do not work. In addition after the Tiltfile processing finishes there's a noted failure on the (Tiltfile) resource.

Successfully loaded Tiltfile (1m14.658932792s)
Cluster status error: Tilt encountered an error connecting to your Kubernetes cluster:
    Get "[https://<redacted>.gr7.us-east-1.eks.amazonaws.com/version?timeout=32s":](https://<redacted>.gr7.us-east-1.eks.amazonaws.com/version?timeout=32s%22:) context deadline exceeded
You will need to restart Tilt after resolving the issue.

We have tested and in 0.33.8 this works without such issue, and I tested with 0.33.15 and the issue since 0.33.9 still persists.

Steps to Reproduce

  1. Configure an eks cluster and authenticate against it
  2. Run tilt up
  3. Wait for resources to load, but then tilt cannot connect to the cluster even while kubectl commands from inside a local or local_resource resources work

Context

tilt doctor Output

$ tilt doctor
Tilt: v0.33.15, built 2024-05-31
System: darwin-arm64
---
Docker
- Host: unix:///Users/<me>/.docker/run/docker.sock
- Server Version: 26.1.1
- API Version: 1.45
- Builder: 2
- Compose Version: v2.27.0-desktop.2
---
Kubernetes
- Env: eks
- Context: kubernetes-eks-dev
- Cluster Name: arn:aws:eks:us-east-1:<redacted-eks-arn-id>:cluster/kubernetes-eks-dev
- Namespace: default
- Container Runtime: containerd
- Version: v1.27.13-eks-3af4770
- Cluster Local Registry: none
---
Thanks for seeing the Tilt Doctor!
Please send the info above when filing bug reports. 💗

The info below helps us understand how you're using Tilt so we can improve,
but is not required to ask for help.
---
Analytics Settings
--> (These results reflect your personal opt in/out status and may be overridden by an `analytics_settings` call in your Tiltfile)
- User Mode: opt-in
- Machine: b8542883618c2effbdb7c7ceed78623b
- Repo: dqZ55OF3HaxcqT2x/Y9LwQ==
# relevant .kube/config
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: <redacted>
    server: https://<redacted>.gr7.us-east-1.eks.amazonaws.com
  name: arn:aws:eks:us-east-1:<redacted-eks-arn-id>:cluster/kubernetes-eks-dev
contexts:
- context:
    cluster: arn:aws:eks:us-east-1:<redacted-eks-arn-id>:cluster/kubernetes-eks-dev
    user: arn:aws:eks:us-east-1:<redacted-eks-arn-id>:cluster/kubernetes-eks-dev
  name: kubernetes-eks-dev
current-context: kubernetes-eks-dev
kind: Config
preferences: {}
users:
- name: arn:aws:eks:us-east-1:<redacted-eks-arn-id>:cluster/kubernetes-eks-dev
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      args:
      - --region
      - us-east-1
      - eks
      - get-token
      - --cluster-name
      - kubernetes-eks-dev
      - --output
      - json
      command: aws
      env:
      - name: AWS_PROFILE
        value: <my-profile-name>

About Your Use Case

This has been happening since 0.33.9 and I forgot to report it right away. This still happens on 0.33.15. For now we've actually added a check in our Tiltfile to force people on to <=0.33.8, until this can be resolved. Maybe it's specific to Amazon EKS's authentication, but I'm not sure.

nicks commented 3 weeks ago

Hmmm...I tried this with my own EKS cluster, and was not able to repro.

I went through all the changes between 0.33.8 and 0.33.9 and didn't see any changes that would affect how tilt computes cluster status.

nicks commented 3 weeks ago

can you post the output of:

kubectl get -v=6 --raw /version

?

brian-bk commented 3 weeks ago

Sure thing

$ kubectl get -v=6 --raw /version
I0607 11:02:35.245274   24199 loader.go:374] Config loaded from file:  /Users/briankleszyk/.kube/config
I0607 11:02:35.953061   24199 round_trippers.go:553] GET https://<redacted>.gr7.us-east-1.eks.amazonaws.com/version 200 OK in 706 milliseconds
{
  "major": "1",
  "minor": "27+",
  "gitVersion": "v1.27.13-eks-3af4770",
  "gitCommit": "4873544ec1ec7d3713084677caa6cf51f3b1ca6f",
  "gitTreeState": "clean",
  "buildDate": "2024-04-30T03:31:44Z",
  "goVersion": "go1.21.9",
  "compiler": "gc",
  "platform": "linux/amd64"
}

🤷 don't know if relevant or not but I (and most of our engineers) are using arm64, with a amd64 cluster.