prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.64k stars 2.15k forks source link

Feature request: graceful configuration reload #2146

Open goran-rumin opened 4 years ago

goran-rumin commented 4 years ago

What did you do? I have 2 Alertmanagers installed in HA and am using Opsgenie integration. Sometimes the configuration is updated and reloaded automatically on both instances at around same time.

What did you expect to see? No alerts notifications towards integrations are dropped.

What did you see instead? Under which circumstances? Sometimes the alert would remain open on Opsgenie, even though it's resolved on Prometheus. After some investigation I pinpointed the issue on config reload. If one AM reloads configuration slightly before the other, no problems (which happens most of the time). But sometimes, when the configuration reloads align just right, some notifications are dropped due to top level context being canceled on both of them. (Logs attached) I fixed this for me by setting up different moments in time for config reload, so that at least one AM is "active" at a time, but was wondering if some kind of graceful shutdown of integrations wound be a good idea.

Environment

simonpasquier commented 4 years ago

Can you confirm that you're reloading Alertmanager and not restarting it? Also would you mind sharing the full logs? I did a quick test (firing alert > Alertmanagers reload > resolved alert) and it seems that even in case of simultaneous reloads, Alertmanager sends the resolved notification.

goran-rumin commented 4 years ago

I can confirm that I am reloading (with /-/reload endpoint) but it seems that I jumped to conclusion too early. I created a minimal 2 node cluster which resembles our setup, with mocked Opsgenie endpoint, and tried to reproduce the issue without success. Reason why I thought this was an issue in the first place was alertmanager_notifications_failed_total metric alongside with debug logs. I would kindly suggest to document that retry logic exists and that metric doesn't track final failure of notification.

RichardWarburton commented 1 year ago

This also seems related to issues #3407 (which has a lot more detail and proposes a fix), #3410 , #2492 and #3037 .