Provide a health endpoint for loadbalancer health check (e.g. Kubernetes readiness probes)

projectcaluma / caluma

A collaborative form editing service

https://caluma.io/

GNU General Public License v3.0

67 stars 26 forks source link

Provide a health endpoint for loadbalancer health check (e.g. Kubernetes readiness probes) #1137

Closed tongpu closed 3 years ago

tongpu commented 4 years ago

For monitoring and ReadinessChecks in Kubernetes it would be beneficial if a dedicated API endpoint (e.g. /healthz) would be available to check the health status of the backend. Possible django plugin that seem to be well maintained would be django-health-check or django-watchman. What we're going to consider healthy in the context of caluma (DB connection possible, ...) needs to be defined.

hairmare commented 4 years ago

The Django apps you link both seem to do a bit more than just report a simple 200/no-200 response on /healthz. At least some of the things they do might be better covered via a dedicated metrics integration ala django-prometheus.

sbor23 commented 4 years ago

Apollo (node.js reference implementation) uses a .well-known endpoint for this. https://www.apollographql.com/docs/apollo-server/monitoring/health-checks/

yoan-adfinis commented 4 years ago

When doing so consider the use cases and what are the options offered by, let's say k8s. https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes

failing liveness probe will restart the container,
failing readiness probe will cut the traffic to the application.

Be wary to not overthink those as it's easy to build a domino cascade.

tongpu commented 4 years ago

I had the following implementation in mind: /healthz URL endpoint takes application status (correct startup) and db connection (maybe other required services too) into consideration and would be used for the readiness probe. Would return 200 HTTP status code when everything is fine and probably 503 if the service is not ready. The goal of the health check we're discussing here is primarily for readiness probes. For liveness probes I would stick to either a TCP check, or even none at all, because I would expect the process to crash when a serious error occurs.

Be wary to not overthink those as it's easy to build a domino cascade.

Absolutely. Primary goal of this issue should be readiness.

sbor23 commented 3 years ago

For readiness we would need to check the postgres-db and minio as well as caches. django-watchman seems to be the way to go and could be extended, minio is missing. I will give it a go.

tongpu commented 3 years ago

For readiness we would need to check the postgres-db and minio as well as caches. django-watchman seems to be the way to go and could be extended, minio is missing. I will give it a go.

I guess we might need different sets of health endpoints. One for readiness check, that solely checks for services required for a working API (postgres-db to my understanding). But what we need to keep in mind of we do that is that all pods in a Kubernetes deployment would become unready and not serve any traffic anymore when the database would be down for any reason. Another endpoint would be more focused on humans checking it, that could cover more services and provide additional output with additional metadata (e.g. why certain downstream services are not available).