minvws / nl-kat-coordination

Repo nl-kat-coordination for minvws
European Union Public License 1.2
122 stars 55 forks source link

Expose telemetry information for various parts of KAT. #3053

Open r3boot opened 2 months ago

r3boot commented 2 months ago

We would like to see some telemetry to determine the state of (parts of) OpenKAT. There are a couple of different approaches for this, depending on the type of web api used. All of them will eventually lead to data being entered into a tool that works with the OpenTelemetry. Based on the data provided via OpenTelemetry we can create dashboards and checks which allows us to follow the state of KAT.

For gunicorn this would mean using the statsD exporter as found within gunicorn itself (https://docs.gunicorn.org/en/stable/instrumentation.html). This information can be picked up and translated towards OpenTelemetry.

For Fastapi/gunicorn a client exists from Prometheus itself which can be included. This will give full control about the metrics exposed. The exported metrics would be up for discussion, but it starts with adding the exporter (https://prometheus.github.io/client_python/exporting/http/fastapi-gunicorn/)

For Django, a middleware exists which can expose telemetry (https://pypi.org/project/django-prometheus/)

The most ideal case would be that OpenKAT comes with its own instance of Prometheus to which all the telemetry is logged, combined with one or more views within OpenKAT which work on this data. This would give KAT a industry-standard platform for telemetry, which can both be used by KAT itself but also the monitoring tools available in the networks where KAT is deployed.

underdarknl commented 2 months ago

Our containers do not run Gunicorn, We can either wrap them with gunicorn and still add the statsD envvar, or we could implement a piece of middleware inside our fastapi apps: https://github.com/tiangolo/fastapi/issues/526#issuecomment-531048973

The gunicorn would make this compatible with our Deb packages, and is a quick and easy fix, the latter would allow us to do more intelligent counters inside our apps and allow the app dev to add even more metrics.

N.B. adding a direct OpenMetrics endpoint on our api-servers introduces a few problems for Prometheus's observing strategy, namely container/endpoint discovery and metric namespacing issues where multiple instances of a single container will return conflicting counters.