netbox-community / netbox-healthcheck-plugin

Apache License 2.0
20 stars 7 forks source link

Simple plaintext / status code url #6

Closed nemith closed 8 months ago

nemith commented 8 months ago

NetBox HealthCheck Plugin version

Latest?

NetBox version

3.0.12

Feature type

Change to existing functionality

Proposed functionality

I was excited to see this project after seeing https://github.com/netbox-community/netbox/issues/8831. However I was a bit surprised that the health checking view/url is returning a full webpage intended for humans (which seems to not be the intention of the aforementioned original issue).

I think the human status page is a great addition but for health checking from health systems for things like load balancers a much more simplistic health check should be done.

I propose that an additional /_health (or similar endpoint) be added with a simple text response of OK (with a status code of 200) or NOT OK (with a status code of 500). Additional statuses could be added if desired for thinks like DEGRADED (however I fail to see where this could be useful).

Simplest implementation is just to return OK as a string and then later enhancements can be added to include DB and REDIS connection stats and even look at counters of failed responses, etc, but I recommend starting easy unless there is something that is easy to implement to give better signal.

Use case

The intention is to be able to return a small payload to determine the health of a system. Multi-kilobyte payloads (like one in a fully templated page) are not great for automated systems looking for the health of a system.

  1. The longer it takes to return a response the more delayed and less useful the health check is for determining a healthy node.
  2. A lot of these systems store the response and don't scale well when the response is larger (i.e templated HTML)

This is used for the same scenarios listed in https://github.com/netbox-community/netbox/issues/8831.

  1. Multi-instance netbox deployments behind a load balancer can route away from unhealthy nodes.
  2. During a deployment with k8s or other systems the health check endpoint is used to determine if the deployment was successful or not. If not they are automatically rolled back. This means deployments can be automated.

Although the existing "pretty" page at /healthcheck works for this use-case, it will not work for my environment where health check response size is important.

External dependencies

No response

llamafilm commented 6 months ago

@nemith did you find an answer for this? I have the same question.