medic / cht-watchdog

Configuration for deploying a monitoring/alerting stack for CHT
GNU Affero General Public License v3.0
4 stars 7 forks source link

Expand base set of alert metrics and determine priority levels #98

Open eljhkrr opened 7 months ago

eljhkrr commented 7 months ago

Existing instance alert rules have been compiled here: https://docs.google.com/spreadsheets/d/1-sq1Bfz-8i3TyVn9rNcYzJKxcy4wreySUC3YAQH3nkA/edit#gid=0 These rules were built from audit data analysis in #35: https://docs.google.com/spreadsheets/d/1ZAHqPidHckvfQUoGdcE2AlPjiduC3A0kyUDNCBYTltw/edit#gid=0 To make the alert system more usable, alert rules need to be reviewed against current deployment needs

mrjones-plip commented 6 months ago

@eljhkrr - we're now adding couch2pg backlog - is this a good place to capture the need to alert on that value? Current thinking is if it increases over a 24 hour period we should alert. Couch2pg runs every 6 hours on most medic prod instances

eljhkrr commented 6 months ago

Thanks @mrjones-plip, this is the right place for alert metric proposals. I've added it to the document to make it easier to review as a batch.

eljhkrr commented 6 months ago

More alert proposals for p0-p3 metrics added to the doc for consideration

mrjones-plip commented 6 months ago

Thanks for the update @eljhkrr !

Any alerts that are dependent on content from ingest data repo should be added there instead of in watchdog. We can do another bind mount of the alert config like we do for watchdog.

eljhkrr commented 6 months ago

Thanks @mrjones-plip, will keep in mind when making updates