Add monitoring via prometheus and cht-watchdog

kennsippell commented 5 months ago

Page views { by METHOD and path and status code }
Users uploaded per instance over time
Logins { success count, failure count, p2 failure reason }
Count of created users { success, failure, retries, p2 time to upload }
Alerts for outages

mrjones-plip commented 3 months ago

recommend that this ticket split this up into three steps/tasks:

expose native Prometheus end point in user man. tool - strongly suggest that no password be needed to hit Prometheus end point. This is a best practice! I believe this article covers publishing custom metrics from a nodejs based app. work with a generic prometheus/grafana instance to test scraping works as expected (could be local dev watchdog instance)
publish new user man tool with Prometheus end points enabled - push to production for KE, TG and UG
work on adding new metrics to medic's watchdog instance and alerting on it as needed

mrjones-plip commented 2 months ago

For item 1 above, let's break it down for just the "Users uploaded per instance over time" metric. Let's further assume this is an int that is ever incrementing. I'd solve it like this:

find a way to store long lived metrics. Looking at the compose file, you don't have any bind mounts and I suspect you'll need to add one so the data persists across container reboots. (as an aside, SQLite might be a fun way to keep metrics, but it doesn't matter - it just needs to support the current and future metrics)
have the app read and write from the metric storage every time a user is created. It needs to read from it to now the count. increment in memory and then write it out to disk. you should ensure this is generic enough that other metrics can be read/written using the same code path.
add a new route or service or what ever other code is needed to expose a new /metrics endpoint accessible via HTTP
ensure the new /metrics endpoint uses the correct format. This docs page with examples likely has what you need.
add a scrape job to your local watchdog dev instance and make sure it shows up!

I'm happy to help with any of these steps! As well, let me know if this isn't the question you were asking - I'm happy to try to answer again ;)

ernestoteo commented 2 months ago

@mrjones-plip thank you.

I am exploring the options you provided above.

Is there a way to directly send these metrics to Prometheus time series database if so are there APIs we can use to send data to ?

mrjones-plip commented 2 months ago

@ernestoteo - Prometheus works in a pull model, not a push model. So the target (user man tool) sets up an HTTP endpoint for anyone to use. Then you tell Prometheus to scrape that target.

I'd be happy to pair with you to give a demo of how this works with Watchdog. You could then easily use the watchdog dev environment to add a new scrape target of your local user man tool that has the new HTTP endpoint.

Also wanted to note that @kennsippell suggested we not use a persistent/long lived metrics (step 1 above) and instead just keep the metrics in memory. They'll reset every reboot of the app, which is fine for now.

mrjones-plip commented 2 months ago

i'm doing some prometheus related development over in the data ingest repo which might interest you @ernestoteo !

medic / cht-user-management

Add monitoring via prometheus and cht-watchdog #68