I'm trying to improve the dashboards used by Digital Earth Australia for monitoring our Datacube OWS deployment, as used for DEA Maps.
We have Datacube OWS deployed into Kubernetes using the ODC Helm Chart, with metrics being recorded by Prometheus, and a dashboard created within Grafana.
The problem I'm having is dealing with noise in the metrics from automated HTTP requests made by K8s to monitor the health of all the OWS instances. Kubernetes provides three probe types, startup, readiness and liveness. We have the startup and readiness probe setup to create a WMS GetMap request for a rarely used layer, and the liveness probe to hit the /ping endpoint.
Noise Problem
However, this creates a continuous level of noise of WMS requests, that makes the recorded metrics hard to use, particularly when automatically scaling up or down to deal with load, but often just all the time. E.g, see the following log captures for the two types of requests.
Questions/(Partial?) Solution
I've heard that making requests to /ping for the probes should be sufficient to check that OWS is operating, including checking connectivity to the DB.
I need this to respond with a code greater than or equal to 200 and less than 400 for success. And any other code indicates failure.
Background
I'm trying to improve the dashboards used by Digital Earth Australia for monitoring our Datacube OWS deployment, as used for DEA Maps.
We have Datacube OWS deployed into Kubernetes using the ODC Helm Chart, with metrics being recorded by Prometheus, and a dashboard created within Grafana.
The problem I'm having is dealing with noise in the metrics from automated HTTP requests made by K8s to monitor the health of all the OWS instances. Kubernetes provides three probe types, startup, readiness and liveness. We have the startup and readiness probe setup to create a WMS GetMap request for a rarely used layer, and the liveness probe to hit the
/ping
endpoint.Noise Problem
However, this creates a continuous level of noise of WMS requests, that makes the recorded metrics hard to use, particularly when automatically scaling up or down to deal with load, but often just all the time. E.g, see the following log captures for the two types of requests.
Questions/(Partial?) Solution
I've heard that making requests to
/ping
for the probes should be sufficient to check that OWS is operating, including checking connectivity to the DB.I need this to respond with a code greater than or equal to 200 and less than 400 for success. And any other code indicates failure.
Is this the behaviour of
/ping
?