grafana / oncall

Developer-friendly incident response with brilliant Slack integration
GNU Affero General Public License v3.0
3.48k stars 285 forks source link

Alerts not being grouped properly #4074

Open CatalinBrz98 opened 7 months ago

CatalinBrz98 commented 7 months ago

What went wrong?

What happened:

What did you expect to happen:

How do we reproduce it?

  1. Have an integration that groups based on the grouping template "{{ payload.alert_uid }}"
  2. Send a few different alerts, with different payload data, all having the same alert_uid, but different formats
  3. Try sending them at up to a few minutes apart
  4. Check if they group together

Grafana OnCall Version

v1.3.112

Product Area

Alert Flow & Configuration

Grafana OnCall Platform?

Docker

User's Browser?

Microsoft Edge

Anything else to add?

This problem seems to not always happen, but almost always when having a second different payload format with the same alert_uid.

CatalinBrz98 commented 7 months ago

Other details that may be useful: I use the following two different formats: { "alert_uid": "e52c2ede-2232-4ea3-a4c5-21ef37e41057", "data": { "grouping_key": "08d6891a-835c-e661-39fa-96b6a9e26552", "service_id": "151", "cluster_id": "3", "title": "TestAlert: The whole system is down", "priority": 1, "image_url": "https://upload.wikimedia.org/wikipedia/commons/e/ee/Grumpy_Cat_by_Gage_Skidmore.jpg", "status": "alerting", "link_to_upstream_details": "https://en.wikipedia.org/wiki/Downtime", "message": "This alert was sent by user for demonstration purposes\nSmth happened. Oh no!" } } { "alert_uid": "e52c2ede-2232-4ea3-a4c5-21ef37e41057", "data": { "grouping_key": "08d6891a-835c-e661-39fa-96b6a9e26552", "message": "The problem has been solved c:", "remediation": true, "status": "OK" } }

I use the following grouping format: {{ payload.get("alert_uid") }}

The alerts are sent through a POST from a python script or from a manual CURL.

Also, if I send the same alerts again, they most of the times get grouped together if they have the same payload, and sometimes they still group together if some values are partially changed. I don't understand when it does and when it doesn't happen.

CatalinBrz98 commented 7 months ago

I think I've found the issue. I had both alerts be triggered in two different routes inside the same endpoint (the default one and a remediation one). It seems like two different alerts with two different routes and escalation chains can't be grouped together. This is a problem though, since this would permit much more complex behavior, of updating steps of alert groups dynamically, while also having a full history of the issue in one single place, but this is not possible with the way things are right now.

mderynck commented 7 months ago

One note, I think you already found out since you are using a python script and curl, the Send demo alert button ignores grouping since it is primarily to quickly test notification flow.

Routing is evaluated first to determine escalation chain and then grouping is evaluated after. Alert groups are 1:1 with an escalation chain, this is why alerts are never grouped across routes. Can you describe in more detail the use case in which you would need to group an alert after it has selected a different route?

CatalinBrz98 commented 7 months ago

I can make the following alert flow:

During all these steps, the same Grafana Oncall alert group has all the relevant data for the current issue, having a history of the issue and being used to chain the steps of the flow. Also, during every step, the relevant users get notified regarding the alert, and possible future notifications don't appear if the alert gets solved before the escalation chain gets to that point.