Alerts not being grouped properly

CatalinBrz98 commented 7 months ago

What went wrong?

What happened:

When having an integration that groups based on the "alert_uid" field from the payload, it is not grouping the alerts properly. Usually, it works if they payload is the same, but it doesn't if the payload is different.

What did you expect to happen:

Alerts with the same "alert-uid" field should always be grouped together.

How do we reproduce it?

Have an integration that groups based on the grouping template "{{ payload.alert_uid }}"
Send a few different alerts, with different payload data, all having the same alert_uid, but different formats
Try sending them at up to a few minutes apart
Check if they group together

Grafana OnCall Version

v1.3.112

Product Area

Alert Flow & Configuration

Grafana OnCall Platform?

Docker

User's Browser?

Microsoft Edge

Anything else to add?

This problem seems to not always happen, but almost always when having a second different payload format with the same alert_uid.

CatalinBrz98 commented 7 months ago

Other details that may be useful: I use the following two different formats: { "alert_uid": "e52c2ede-2232-4ea3-a4c5-21ef37e41057", "data": { "grouping_key": "08d6891a-835c-e661-39fa-96b6a9e26552", "service_id": "151", "cluster_id": "3", "title": "TestAlert: The whole system is down", "priority": 1, "image_url": "https://upload.wikimedia.org/wikipedia/commons/e/ee/Grumpy_Cat_by_Gage_Skidmore.jpg", "status": "alerting", "link_to_upstream_details": "https://en.wikipedia.org/wiki/Downtime", "message": "This alert was sent by user for demonstration purposes\nSmth happened. Oh no!" } } { "alert_uid": "e52c2ede-2232-4ea3-a4c5-21ef37e41057", "data": { "grouping_key": "08d6891a-835c-e661-39fa-96b6a9e26552", "message": "The problem has been solved c:", "remediation": true, "status": "OK" } }

I use the following grouping format: {{ payload.get("alert_uid") }}

The alerts are sent through a POST from a python script or from a manual CURL.

Also, if I send the same alerts again, they most of the times get grouped together if they have the same payload, and sometimes they still group together if some values are partially changed. I don't understand when it does and when it doesn't happen.

CatalinBrz98 commented 7 months ago

I think I've found the issue. I had both alerts be triggered in two different routes inside the same endpoint (the default one and a remediation one). It seems like two different alerts with two different routes and escalation chains can't be grouped together. This is a problem though, since this would permit much more complex behavior, of updating steps of alert groups dynamically, while also having a full history of the issue in one single place, but this is not possible with the way things are right now.

mderynck commented 7 months ago

One note, I think you already found out since you are using a python script and curl, the Send demo alert button ignores grouping since it is primarily to quickly test notification flow.

Routing is evaluated first to determine escalation chain and then grouping is evaluated after. Alert groups are 1:1 with an escalation chain, this is why alerts are never grouped across routes. Can you describe in more detail the use case in which you would need to group an alert after it has selected a different route?

CatalinBrz98 commented 7 months ago

I can make the following alert flow:

I receive a few alerts regarding the same problem that get grouped together, and an escalation chain gets triggered
The escalation chain sends a request with the alerts relevant data to both Jira to create a new Issue and another module using outgoing webhooks that does some basic remediations
The second module tries to do the remediation but isn't able to do so because the problem needs to be solved by another module
It sends a response to Grafana Oncall signaling the current situation. Grafana Oncall receives and groups it together with the other data, but sees the type of alert it isx and gets sent through a different route
A new escalation chain gets triggered for the alert group. A new request gets sent to a different module to do the more complex remediation
This module solves the problem and sends back a response to Grafana Oncall
Grafana Oncall receives the final problem solved alert. It gets grouped together with the rest of them and sends an "Issue solved" request on Jira for the current alert's Issue

During all these steps, the same Grafana Oncall alert group has all the relevant data for the current issue, having a history of the issue and being used to chain the steps of the flow. Also, during every step, the relevant users get notified regarding the alert, and possible future notifications don't appear if the alert gets solved before the escalation chain gets to that point.

grafana / oncall