grafana / grafana

The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.
https://grafana.com
GNU Affero General Public License v3.0
64.19k stars 12.01k forks source link

Documentation feedback: /docs/sources/alerting/fundamentals/notifications/group-alert-notifications.md #92562

Open alphneo opened 1 month ago

alphneo commented 1 month ago

Hi,

I can not fully grasp the exact behavior of group intervals and repeat intervals, and there is room for improvement in documentation. I have the following doubts, and I hope you can clarify them.

I believe the incoming alert instance into a group remains in the group after the evaluation wait period is over and the alert still fires for each evaluation until a notification is sent. If I am not wrong here, please consider adding an evaluation interval case, if some rule is fired first and not fired again just before the group wait/group interval elapsed, is it in the group or not during the group wait/group interval?

At 00:30 after the group wait elapsed for frontend notification policy group, 2 alerts were notified and then during group interval for 5 minutes there were only 2 alerts fired for the same group, so how 4 alerts were notified after the group interval elapsed, have you also considered the first 2 which were sent during group wait, I guess they do not have to be if backend notification policy group is seen at 05:50 after group interval lapsed nothing was sent even the 2 alerts triggered during the group wait and so after group interval elapse should it be 2 alerts!

Image

After repeat interval is met, 4 alerts were considered to be for both of the notification policy groups, I don't see there is clear way of explanation, has the alert should be continuously fired for entire repeat interval duration without entering into normal state even once, or if any alert fired twice,1 before the start of repeat interval and 2 after the repeat interval elapse and in between it can be in normal state, in such case it does not sound like a reminder than a new one and in another case if a new alert pops up right before repeat interval is it considered.

Image

Please consider updating the documentation it can save time a lot because it is a bit harder to know what is happening, without this one needs to spend time experimenting with notifications and flooding the in-box to know the deterministic behavior.

Documentation source: https://grafana.com/docs/grafana/latest/alerting/fundamentals/notifications/group-alert-notifications/

Thank you

brendamuir commented 1 week ago

Hi @ppcano I think Alert Grouping was recently updated by you - can you please take a look at this issue? Thanks!

https://grafana.com/docs/grafana/latest/alerting/fundamentals/notifications/group-alert-notifications/#group-wait

ppcano commented 1 week ago

Hi @alphneo ,

Thank you for sharing your feedback. We know the behavior of the timers can be somewhat confusing (to put it lightly). We provided an example in the documentation to help clarify this, but as you pointed out, there's still room for improvement.

The best way to grasp how they function is to experiment with dummy alerts, view them in the Grafana Alerting UI, and receive their notifications. Also for additional references, these timers operate similarly to the Prometheus AlertManager settings: group_interval, repeat_interval, and group_wait. You can read more about these here: https://prometheus.io/docs/alerting/latest/configuration/#route

Let me address your questions and comments. Please feel free to correct me if I misunderstood anything.

So, how were 4 alerts notified after the group interval elapsed? Have you also considered the first 2, which were sent during group wait?

Yes, the first 2 alerts are still part of the frontend group. Please note that the "Number of instances" column reflects the current number of alerts in the group at any particular "Time".

The first 2 alerts remained in the group because they are still firing. According to the documentation:

"An alert instance exits the group after being resolved and notified of its state change." (we should probably highlight this more)

I guess they don’t need to be, if the backend notification policy group is seen at 05:50?

Why not? The backend and frontend alerts belong to different groups, and these groups are entirely independent - they are not related to each other.

After the repeat interval is met, 4 alerts were considered for both notification policy groups. I don’t see a clear explanation for this.

This is due to the same behavior: "An alert instance exits the group after being resolved". In this case, the 8 alerts (4 frontend alerts and 4 backend alerts) have not yet been resolved, so they remain in their respective groups.

I hope this clarifies your question. Please feel free to follow up for further questions or explanation.

For details on how alert evaluation works, see also Alert rule evaluation : The alert rule is continuously evaluated at other intervals, and generating the same alert instance (identified by its label set).