Machine ID: Add healthz/readyz endpoints to `tbot`

webvictim commented 1 year ago

What would you like `tbot` to do?

Expose/healthz//readyz endpoints like Teleport does (https://goteleport.com/docs/reference/metrics/)

What problem does this solve?

Monitoring usability and performance of tbot

If a workaround exists, please include it.

Manually scrape logs and check process via system tools.

jBouyoud commented 1 year ago

As also with availability to log in json 🙏

strideynet commented 1 year ago

Going to rename this ticket to just refer to healthz and readyz since we have metrics now.

strideynet commented 1 year ago

As also with availability to log in json 🙏

Coming soon https://github.com/gravitational/teleport/pull/30755

strideynet commented 11 months ago

Ticket raised by @programmerq with additional details

Expected behavior:

When running tbot --diag-addr=0.0.0.0:3000, tbot should provide /healthz and /readyz endpoints for use in configuring liveness and readiness probes in Kubernetes deployments or StatefulSets.

The health endpoint(s) should reflect whether the bot has successfully been able to provide the credentials it is supposed to. That way kube can restart the pod to fix a situation.

Current behavior:

Currently, running tbot --diag-addr=0.0.0.0:3000 only sets up the /metrics and /pprof endpoints. It does not provide /healthz or /readyz endpoints, which are necessary for effectively managing the health and readiness of the containerized tbot process within a Kubernetes cluster.

Bug details:

Teleport version: 14.1.5
Recreation steps:
- Run tbot with the diagnostic address flag: --diag-addr=0.0.0.0:3000.
- /healthz and /readyz are not present.

See: https://github.com/gravitational/teleport/blob/v14.1.5/lib/tbot/tbot.go#L168-L188

strideynet commented 11 months ago

We'll implement this as two endpoints:

Liveness: Begins returning 200 as soon as tbot starts
Readiness: Begins returning 200 as soon as tbot has successfully joined the cluster and can start outputting outputs and offering services. This should return an error when tbot is shutting down.

At this time, I'd rather avoid using the concept of "healthiness" since this doesn't seem to actually tie to a tangible state of tbot. This may change in future if we have the concept of outputs which run separately.

gravitational / teleport