Healthchecks - Githubissues

andrewhowdencom commented 4 years ago

Something like:

https://github.com/InVisionApp/go-health

In that there's an abstract health handler, and components register themselves as either healthy or not. They periodically check in their health, and it times out.

Two interesting ideas here:

SLO Based health check

Essentially whenever a given instance of an application is not able to meet its SLOs (as measured internally with a reset period) it marks itself as unhealthy. In this way if it's next to a noisy neighbor or otherwise, it can be automatically rescheduled.

Fail open

While currently unclear how to override this, it should be possible to artificially mark all services instantly and widely "healthy" regardless of their internal state.

This allows mitigating upstream failures; faster recovery. It would be even better with some sort of timeout.

Just propagating this via the "standard configuration" (practically envvar → file → cli) should be fine as deployment tooling should allow updating of the file in place at runtime.

depends:

5
2

andrewhowdencom commented 4 years ago

It would be interesting if we do smth like a grpc stream for interested control plans and push a new event whenever our health status changes (i.e. healthy, degraded and failed)

should failed just terminate? :thinking: how much of this do we event want in the app.

andrewhowdencom commented 4 years ago

Checks:

PKI material expiry. If it's expired, that service is unhealthy.

andrewhowdencom commented 4 years ago

liveness → whether or not to restart the pod (i.e. im dead) readiness → whether or not to include the pod in the pool to serve requests.

andrewhowdencom commented 4 years ago

anything that needs credentials should be probed periodically.... or should it. :thinking:

At least its optional.

andrewhowdencom commented 4 years ago

Should healthchecks be able to be rendered into a basic status page? Or aggregated into one :thinking:

Current thinking is that applications probably shouldn't have a notion of "service health" in and of them-self, but control plane services might. Maybe the service registers itself against that control plane? Maybe there's a "service level" configuration bundled with an application that defines its SLOs and so? (overloaded API definitions also fine).

Maybe another binary that's designed as a controller that reads the API definitions? or even have the application generate another endpoint which lists the operations and their SLOs, and just let auto discovery do its thing? It would have to know the SLO and the metric through which it was measured, then it'd have to look them up. Its the sort of thing that probably makes the most sense exported to a kubernetes CRD (or so) which is updated via deployment.

andrewhowdencom commented 4 years ago

Should applications express things that are anomalous but not necessarily bad? For example, a connection pool might be exhausted, but the application is one replica among many — it doesn't matter. It only matters when you're doing the analysis.

In a sense you can "hoist" the "unexpected conditions" for an application to be in for easy visibility; perhaps as application/problem+json.

andrewhowdencom commented 3 years ago

Healthchecks should be excluded from statistics

littlemanco / the-golden-path.net

Healthchecks #4

SLO Based health check

Fail open

5

2