tilt-dev / tilt

Define your dev environment as code. For microservice apps on Kubernetes.
https://tilt.dev/
Apache License 2.0
7.42k stars 290 forks source link

cluster liveness check fails #5976

Open nicks opened 1 year ago

nicks commented 1 year ago

Current Behavior

ahy in the slack channel reports that when they connect tilt to their remote cluster, it fails with:

Cluster status error: cluster did not pass liveness check

If they try to run the liveness check manually, they get

$ kubectl get --raw='/livez?verbose'
Error from server (NotFound): the server could not find the requested resource

It appears that livez was added in Kubernetes 1.16 and is not supported on their Rancher distro.

The confirm the /healthz check works though

Possible Solutions

Maybe we should only use /healthz? not sure what the additional benefit of using /livez is.

Alternatively, if we get a 404 from /livez, we could ignore it.

nicks commented 1 year ago

@milas any chance you remember what the reasoning was behind the different health checks?

alternatively, maybe we just skip the health checks on older versions of kubernetes... https://kubernetes.io/docs/reference/using-api/health-checks/

milas commented 1 year ago

Used /livez because of this note from the health checks doc:

The healthz endpoint is deprecated (since Kubernetes v1.16), and you should use the more specific livez and readyz endpoints instead.


Alternatively, if we get a 404 from /livez, we could ignore it.

This seems reasonable - could also try to fallback to a /readyz in this case

atsai1220 commented 1 year ago

Running into this issue as well from our Rancher environment in:

❯ tilt version
v0.31.2, built 2023-02-10

We downgraded to the following version to continue to use tilt.

v0.28.1, built 2022-05-01

Browsing through the codebase, I believe an enhancement to verify against 404 can be implemented here. Additionally, can we fallback to /healthz as well?: https://github.com/tilt-dev/tilt/blob/95a35874112c38057685a3342c4924c83e9d1b7b/internal/k8s/client.go#L765-L786

On the other hand, Rancher can be updated to include /livez or readyz because Kubernetes documentation mentioned:

Machines that check the healthz/livez/readyz of the API server should rely on the HTTP status code.here:

I believe this is where Rancher generates the listener for /healthz. https://github.com/rancher/rancher/blob/e2410e02494a5b4bd43c50d8d45ed7df5a3ad0a8/pkg/api/steve/health/health.go#L10-L19

lewis-kori commented 1 year ago

@atsai1220 how do you downgrade tilt? currently facing the same issue

atsai1220 commented 1 year ago

@atsai1220 how do you downgrade tilt? currently facing the same issue

Navigate to the Release page of this repository and download from the Assets menu of your desired version.

Copy the URL for your operating system and retrieve the package:

wget https://github.com/tilt-dev/tilt/releases/download/v0.28.1/tilt.0.28.1.linux.x86_64.tar.gz
MatanAmoyal1 commented 9 months ago

Any plan to add "/healthz" to cluster api health checks? I'm working on k8s 1.20.15 via Rancher. and currently blocked from using latest tilt version :(

nicks commented 9 months ago

@MatanAmoyal1 hmmm... /livez should work fine in k8s 1.20, are you sure you're not hitting some other issue / blocking it some other way?

MatanAmoyal1 commented 9 months ago

@nicks it's looks like the same issue. (k8s 1.20 via Rancher) healthz works, but livez not.

` ➜ ~ kubectl proxy&
[1] 37020 ➜ ~ Starting to serve on 127.0.0.1:8001

➜ ~ curl 127.0.0.1:8001/healthz ok%
➜ ~ curl 127.0.0.1:8001/livez
404 page not found

`

MatanAmoyal1 commented 9 months ago

@nicks any plan to merge this PR https://github.com/tilt-dev/tilt/pull/6065 ?

nicks commented 8 months ago

fwiw, i have been unable to reproduce this problem:

k3d cluster create -i rancher/k3s:v1.20.15-k3s1
kubectl get --raw='/livez?verbose'

seems to produce a valid healthcheck for me. is it possible that your devops team is blocking the kubernetes healthcheck routes?

samuellvicente commented 8 months ago

Unfortunately I'm stuck using a version of Openshift 3, (k8s v1.11) and so I'm unable to use the current version of Tilt as the livez endpoint is not present. Is there any plans on fixing this issue? So far i've been using v0.28.1 and it works

Richie24 commented 6 months ago

We are using ranchers included k3s kubernetes, and its livez check is behind authentication: https://github.com/k3s-io/k3s/issues/3576#issuecomment-875041119

So, we are sadly also forced to fall back to an older tilt version...

nicks commented 6 months ago

@Richie24 the issue you pointed to is a 401 rather than the 404 reported in other comments, so it sounds like you're hitting a different problem. fwiw, tilt uses your kubectl credentials, so auth shouldn't affect things.