Open CatalinBrz98 opened 7 months ago
Other details that may be useful:
I use the following two different formats:
{ "alert_uid": "e52c2ede-2232-4ea3-a4c5-21ef37e41057", "data": { "grouping_key": "08d6891a-835c-e661-39fa-96b6a9e26552", "service_id": "151", "cluster_id": "3", "title": "TestAlert: The whole system is down", "priority": 1, "image_url": "https://upload.wikimedia.org/wikipedia/commons/e/ee/Grumpy_Cat_by_Gage_Skidmore.jpg", "status": "alerting", "link_to_upstream_details": "https://en.wikipedia.org/wiki/Downtime", "message": "This alert was sent by user for demonstration purposes\nSmth happened. Oh no!" } }
{ "alert_uid": "e52c2ede-2232-4ea3-a4c5-21ef37e41057", "data": { "grouping_key": "08d6891a-835c-e661-39fa-96b6a9e26552", "message": "The problem has been solved c:", "remediation": true, "status": "OK" } }
I use the following grouping format:
{{ payload.get("alert_uid") }}
The alerts are sent through a POST from a python script or from a manual CURL.
Also, if I send the same alerts again, they most of the times get grouped together if they have the same payload, and sometimes they still group together if some values are partially changed. I don't understand when it does and when it doesn't happen.
I think I've found the issue. I had both alerts be triggered in two different routes inside the same endpoint (the default one and a remediation one). It seems like two different alerts with two different routes and escalation chains can't be grouped together. This is a problem though, since this would permit much more complex behavior, of updating steps of alert groups dynamically, while also having a full history of the issue in one single place, but this is not possible with the way things are right now.
One note, I think you already found out since you are using a python script and curl, the Send demo alert button ignores grouping since it is primarily to quickly test notification flow.
Routing is evaluated first to determine escalation chain and then grouping is evaluated after. Alert groups are 1:1 with an escalation chain, this is why alerts are never grouped across routes. Can you describe in more detail the use case in which you would need to group an alert after it has selected a different route?
I can make the following alert flow:
During all these steps, the same Grafana Oncall alert group has all the relevant data for the current issue, having a history of the issue and being used to chain the steps of the flow. Also, during every step, the relevant users get notified regarding the alert, and possible future notifications don't appear if the alert gets solved before the escalation chain gets to that point.
What went wrong?
What happened:
What did you expect to happen:
How do we reproduce it?
Grafana OnCall Version
v1.3.112
Product Area
Alert Flow & Configuration
Grafana OnCall Platform?
Docker
User's Browser?
Microsoft Edge
Anything else to add?
This problem seems to not always happen, but almost always when having a second different payload format with the same alert_uid.