Feature request: graceful configuration reload

goran-rumin commented 4 years ago

What did you do? I have 2 Alertmanagers installed in HA and am using Opsgenie integration. Sometimes the configuration is updated and reloaded automatically on both instances at around same time.

What did you expect to see? No alerts notifications towards integrations are dropped.

What did you see instead? Under which circumstances? Sometimes the alert would remain open on Opsgenie, even though it's resolved on Prometheus. After some investigation I pinpointed the issue on config reload. If one AM reloads configuration slightly before the other, no problems (which happens most of the time). But sometimes, when the configuration reloads align just right, some notifications are dropped due to top level context being canceled on both of them. (Logs attached) I fixed this for me by setting up different moments in time for config reload, so that at least one AM is "active" at a time, but was wondering if some kind of graceful shutdown of integrations wound be a good idea.

Environment

Alertmanager version:

alertmanager, version 0.17.0 (branch: HEAD, revision: c7551cd75c414dc81df027f691e2eb21d4fd85b2)
build user:       root@932a86a52b76
build date:       20190503-09:10:07
go version:       go1.12.4

Logs:

level=debug ts=2019-12-17T08:05:00.481869442Z caller=dispatch.go:264 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="Post https://api.eu.opsgenie.com/v2/alerts: context canceled"
level=debug ts=2019-12-17T08:05:01.504148372Z caller=dispatch.go:264 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="Post https://api.eu.opsgenie.com/v2/alerts: context canceled"
level=debug ts=2019-12-17T08:05:01.650700099Z caller=dispatch.go:264 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="Post https://api.eu.opsgenie.com/v2/alerts: context canceled"

simonpasquier commented 4 years ago

Can you confirm that you're reloading Alertmanager and not restarting it? Also would you mind sharing the full logs? I did a quick test (firing alert > Alertmanagers reload > resolved alert) and it seems that even in case of simultaneous reloads, Alertmanager sends the resolved notification.

goran-rumin commented 4 years ago

I can confirm that I am reloading (with /-/reload endpoint) but it seems that I jumped to conclusion too early. I created a minimal 2 node cluster which resembles our setup, with mocked Opsgenie endpoint, and tried to reproduce the issue without success. Reason why I thought this was an issue in the first place was alertmanager_notifications_failed_total metric alongside with debug logs. I would kindly suggest to document that retry logic exists and that metric doesn't track final failure of notification.

RichardWarburton commented 1 year ago

This also seems related to issues #3407 (which has a lot more detail and proposes a fix), #3410 , #2492 and #3037 .

prometheus / alertmanager

Feature request: graceful configuration reload #2146