prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.57k stars 2.14k forks source link

Feature request: Add group creation time to group_by hash #3959

Open ccope opened 4 weeks ago

ccope commented 4 weeks ago

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

Environment

grobinson-grafana commented 4 weeks ago

Hi! :wave:

I think the main issue here is that Alertmanager cannot close incidents if all alerts in a group are silenced.

When silencing alerts for an active incident, you need to take care and make sure the incident is closed in your IRM (Opsgenie). If you leave the incident open, and new alerts are sent from Alertmanager to the same incident, you may or may not get paged for them.

I also recommend checking your Opsgenie configuration, as it sounds like the incident might have been left open by mistake? This shouldn't happen as you should be paged at regular intervals for active incidents until they are resolved.

To answer some of your questions:

Add group creation time to group_by hash

This won't work I'm afraid. Consider the case where the system clock on two Alertmanager servers are out of sync by 1ns. You will have different group creation times on each Alertmanager server, creating duplicate incidents in your IRM.

A new batch of alerts should not be grouped into an already resolved group

Given it had been a week since the last alert was resolved, and I assume there were no other active alerts in the group during that time, Alertmanager would have created a new group for these new alerts. However, group keys are deterministic, and if a group is "re-opened" it will re-use the same group key. This is intentional.