We use OnCall to run Grafana Cloud Logs. Our infrastructure comprises many k8s clusters and Loki installations. We try to keep all of our alerts actionable, and these alerts largely fall into 3 categories right now:
critical: wake up the on-caller to address this immediately (even outside of working hours)
notify: ping the on-caller on Slack only, so that they're aware of the alert and they can address it during working hours
warn: don't draw the on-caller's attention to an alert directly, but if the on-caller has time and is watching out for these alerts, they may choose to address them
We have to keep critical pages, but the other two could both be lumped together into the same category if we had a way of tracking the alerts that have fired; this is where a ticketing system might come in.
In my mind, this is how I imagine it working:
an alert fires and is not extremely urgent to warrant paging an engineer
an entry gets added to a backlog of alerts to address
2.1. if the alert self-resolves, mark the entry as auto-resolved
2.2. if further alerts fire which have the same name (or have the same grouping), reparent them under the original
the on-caller can address the entries in their own time, and mark the entries as done / unactionable / etc
There's obviously quite a lot to this so I'll stop there, but keen to get the conversation started. I think OnCall is the natural place for this (as opposed to GitHub Issues, although that could be an alternative target?), and will provide the following benefits:
visibility into fired alerts which have not yet been addressed + addressed previously
allow us to downgrade almost all but the most urgent alerts to non-paging, because we won't forget to address them if they're in the queue (wishful thinking of course, but visibility is step 0, tracking is step 1)
we could export metrics of the most noisy alerts and use this to determine which alerts need to be reconfigured or automated away
We use OnCall to run Grafana Cloud Logs. Our infrastructure comprises many k8s clusters and Loki installations. We try to keep all of our alerts actionable, and these alerts largely fall into 3 categories right now:
critical
: wake up the on-caller to address this immediately (even outside of working hours)notify
: ping the on-caller on Slack only, so that they're aware of the alert and they can address it during working hourswarn
: don't draw the on-caller's attention to an alert directly, but if the on-caller has time and is watching out for these alerts, they may choose to address themWe have to keep
critical
pages, but the other two could both be lumped together into the same category if we had a way of tracking the alerts that have fired; this is where a ticketing system might come in.In my mind, this is how I imagine it working:
There's obviously quite a lot to this so I'll stop there, but keen to get the conversation started. I think OnCall is the natural place for this (as opposed to GitHub Issues, although that could be an alternative target?), and will provide the following benefits: