Alert grouping shouldn't take care of routes

bmalynovytch commented 4 months ago

What went wrong?

Config:

Grouping based on a string made with the hostname (server being in trouble) and the related service (service on the server) (ie. "server1--HTTPS)
Integration routes configured based on the state of the service: if "warning" use "non urgent" escalation chain, otherwise use "critical" escalation chain

What happened:

Incoming alert with state "warning" triggers "non urgent" escalation chain, fine.
Incoming (recovery) alert with state "OK" is routed through "critical" escalation chain which duplicates the alerts (one is still "firing", the other is "resolved")
Grouping failed

What did you expect to happen:

When the recovery alert comes in, it should be properly grouped with the first one and make autoresolution in action

How do we reproduce it?

Config:

Grouping based on a string made with the hostname (server being in trouble) and the related service (service on the server) (ie. "server1--HTTPS)
Integration routes configured based on the state of the service: if "warning" use "non urgent" escalation chain, otherwise use "critical" escalation chain

Reproduce:

Incoming alert with state "warning" triggers "non urgent" escalation chain.
Incoming (recovery) alert with state "OK"

=> duplicate alert (one is still "firing", the other is "resolved")

Grafana OnCall Version

v1.4.7

Product Area

Alert Flow & Configuration

Grafana OnCall Platform?

Kubernetes

User's Browser?

No response

Anything else to add?

If one thinks that routing should indeed make duplicated alerts, then think about living the choice to people: it is possible to change grouping strategy including the same rules as included in routing, while the opposite is impossible.

Example: if you which to group based on the state of the incoming alert, then you could group with "$server--$service--$state".

bmalynovytch commented 4 months ago

Related to #4074, #4276, #3129 and many other where users complain about grouping failures.

mderynck commented 4 months ago

Generally how we see people accomplishing this is by keeping grouping and routing separate from state or fields that are changing value (thresholds, occurrences, etc.). If you can separate your severity from your state you would have your grouping by server name, routing by a severity (warning/critical) and a state (firing/ok) which is independent and only evaluated for resolution. That way the state change would not result in the resolution payload creating a new alert group. Although it depends on the alert payloads if you are able to configure it that way.

bmalynovytch commented 4 months ago

Hi @mderynck !

Thank you for your comment. Could you then explain how to achieve the following ?

Let's say "server A" has it's "CPU temperature" probe triggering the "warning" level. This means the server starts being too hot which remains acceptable (not dangerous). In such a case, the on-call agent doesn't need to be notified while asleep nor during the weekend, that's why we use a routing based on the "warning" level (not really urgent, nobody in the "non urgent" schedule).

For the purpose of the discussion, let's say the T° probe fires at warning ⚠️ level at 9pm on saturday. No on-call agent needs to be notified. The alert group is "firing" with no notification, which is perfect. 👌

Then, a couple of hours later, the same server reaches the "critical" 🚨 level. At that T°, the server starts to be in danger: CPU might either be damaged or make errors. This time, the routing MUST be different, as the on-call agent absolutely needs to be informed about the incident, even though he is sleeping.

In OnCall, this will duplicate the alert group (because of the different routing rules), which means double notifications and no auto-resolution / auto-ACK on at least one of the alert groups. 😭

mderynck commented 3 months ago

Routing happens before grouping. In oncall alerts that are going to take different escalation paths are going to be different alert groups meaning if you want the notification behavior you describe there will be two alert groups one for the warning and one for the critical. These are not duplicates these are different things by definition because they have different escalation behaviors. The temperature warning alert cannot "become" the temperature critical alert. How we see this being handled is they are independent events.

Lets say I have a webhook integration, its Grouping template is {{- payload.alert_type + "-" + payload.server_name }} and its Autoresolution template is the default {{ payload.get("state", "").upper() == "OK" }}. Now I also setup two routes: first one which will call my phone and has a routing template {{- payload.severity == "critical" }} and the second one which does nothing has routing template {{- payload.severity == "warning" }}

First alert arrives:

{
  "server_name": "server A",
  "alert_type": "temperature",
  "severity": "warning",
  "state": "firing"
}

Alert group is created for route 2 and no one is contacted as it is a warning, it shows as firing in the UI.

Second alert arrives:

{
  "server_name": "server A",
  "alert_type": "temperature",
  "severity": "critical",
  "state": "firing"
}

Another alert group is created but instead for route 1, the user is called because it is critical and the user acknowledges it. In the UI two alert groups display the critical one is acknowledged and the warning is firing.

Third alert arrives:

{
  "server_name": "server A",
  "alert_type": "temperature",
  "severity": "critical",
  "state": "firing"
}

Goes to route 1 and is added to the open critical alert group we already have based on the grouping criteria. No one is contacted since the critical group already escalated. In the UI two alert groups display the acknowledged critical one now shows 2 alerts counted and the warning is still firing.

Fourth alert arrives:

{
  "server_name": "server A",
  "alert_type": "temperature",
  "severity": "critical",
  "state": "ok"
}

Goes to route 1 and matched with the existing critical alert group by the grouping criteria. This event now says server A no longer has critical temperature. In the UI we now see 1 resolved critical alert group with 3 alerts and 1 firing warning alert group with 1 alert.

Fifth alert arrives:

{
  "server_name": "server A",
  "alert_type": "temperature",
  "severity": "warning",
  "state": "ok"
}

Goes to route 2 and matches the existing warning alert group by the grouping criteria. This event now says server A no longer has warning temperature. In the UI we now see 1 resolved critical alert group with 3 alerts and 1 resolved warning alert group with 2 alerts.

bmalynovytch commented 3 months ago

While I understand the point of view, monitoring systems don't split based on the severity. Their grouping is based on hosts and services, which means that when severity changes, it's just an update for the same object. The consequence is that they won't send a "oh, warning is now over, end of event, but, wait, it's now critical, you should open a new event".

I just feel like, because routing was coded to be played before grouping, then usage should comply with it. I definitely think that the choice should be left to users: if grouping is done first, then users will be able to reproduce both use cases which is impossible the other way round.

mderynck commented 3 months ago

For reference can you link the monitoring system you are using that is sending these?

More flexibility in grouping and routing order would be nice to have. There are no immediate plans to make changes in this area.

bmalynovytch commented 3 months ago

We use derivates from Nagios. Tools like Nagios, Icinga, Shinken, CheckMK, Cacti will work the same. You can also add SaaS tools like Uptime Robot.

tyrken commented 3 months ago

We have a number of custom systems feeding events (AWS, Cloudflare, GitHub, etc.) which translate the events into OnCall alerts that expect to group the "OK" message the same as the previous "Alerting" event message that came before. Unfortunately we can decide the severity of the alert only with data that came in with that previous alerting message - so can't always match the same severity we did before & are getting unresolved alerts all the time.

The current behaviour is unintuitive and AFAIK undocumented apart from this GH issue - please fix ASAP or we'll have to force severity to a fixed value & wake up engineers for trivial problems, which will generate bad feedback for Grafana OnCall as a system.

grafana / oncall