Open eljhkrr opened 7 months ago
@eljhkrr - we're now adding couch2pg backlog - is this a good place to capture the need to alert on that value? Current thinking is if it increases over a 24 hour period we should alert. Couch2pg runs every 6 hours on most medic prod instances
Thanks @mrjones-plip, this is the right place for alert metric proposals. I've added it to the document to make it easier to review as a batch.
More alert proposals for p0-p3 metrics added to the doc for consideration
Thanks for the update @eljhkrr !
Any alerts that are dependent on content from ingest data repo should be added there instead of in watchdog. We can do another bind mount of the alert config like we do for watchdog.
Thanks @mrjones-plip, will keep in mind when making updates
Existing instance alert rules have been compiled here: https://docs.google.com/spreadsheets/d/1-sq1Bfz-8i3TyVn9rNcYzJKxcy4wreySUC3YAQH3nkA/edit#gid=0 These rules were built from audit data analysis in #35: https://docs.google.com/spreadsheets/d/1ZAHqPidHckvfQUoGdcE2AlPjiduC3A0kyUDNCBYTltw/edit#gid=0 To make the alert system more usable, alert rules need to be reviewed against current deployment needs