prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.68k stars 2.16k forks source link

reload of alertmanager resend notifications for all exists alerts #627

Closed natalia-k closed 7 years ago

natalia-k commented 7 years ago

Hi,

I find that reload of alertmanager (curl -XPOST http://localhost:9093/-/reload) without performing any changes in alertmanager.yml resend notifications for all exists alerts. I have version 0.5.1

for example : %journalctl -u alertmanager.service | grep -e " alert=NOC_MemoryUsage_per_container[bbd0ed6]" -e "Loading configuration file"

Feb 16 00:47:09 mon1 alertmanager[4452]: time="2017-02-16T00:47:09-05:00" level=info msg="Loading configuration file" file="/etc/prometheus/alertmanager.yml" source="main.go:195"
Feb 16 00:47:37 mon1 alertmanager[4452]: time="2017-02-16T00:47:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 00:48:37 mon1 alertmanager[4452]: time="2017-02-16T00:48:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 00:49:37 mon1 alertmanager[4452]: time="2017-02-16T00:49:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 00:50:37 mon1 alertmanager[4452]: time="2017-02-16T00:50:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 00:51:37 mon1 alertmanager[4452]: time="2017-02-16T00:51:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 00:52:37 mon1 alertmanager[4452]: time="2017-02-16T00:52:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 00:53:37 mon1 alertmanager[4452]: time="2017-02-16T00:53:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 00:54:37 mon1 alertmanager[4452]: time="2017-02-16T00:54:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 00:55:31 mon1 alertmanager[4452]: time="2017-02-16T00:55:31-05:00" level=info msg="Loading configuration file" file="/etc/prometheus/alertmanager.yml" source="main.go:195"
Feb 16 00:55:31 mon1 alertmanager[4452]: time="2017-02-16T00:55:31-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 00:55:37 mon1 alertmanager[4452]: time="2017-02-16T00:55:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 00:56:37 mon1 alertmanager[4452]: time="2017-02-16T00:56:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 00:57:37 mon1 alertmanager[4452]: time="2017-02-16T00:57:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 00:58:37 mon1 alertmanager[4452]: time="2017-02-16T00:58:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 00:59:37 mon1 alertmanager[4452]: time="2017-02-16T00:59:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 01:00:37 mon1 alertmanager[4452]: time="2017-02-16T01:00:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 01:01:37 mon1 alertmanager[4452]: time="2017-02-16T01:01:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 01:02:37 mon1 alertmanager[4452]: time="2017-02-16T01:02:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 01:03:09 mon1 alertmanager[4452]: time="2017-02-16T01:03:09-05:00" level=info msg="Loading configuration file" file="/etc/prometheus/alertmanager.yml" source="main.go:195"
Feb 16 01:03:09 mon1 alertmanager[4452]: time="2017-02-16T01:03:09-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 01:03:37 mon1 alertmanager[4452]: time="2017-02-16T01:03:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 01:04:37 mon1 alertmanager[4452]: time="2017-02-16T01:04:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 01:05:37 mon1 alertmanager[4452]: time="2017-02-16T01:05:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 01:06:37 mon1 alertmanager[4452]: time="2017-02-16T01:06:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"
Feb 16 01:07:37 mon1 alertmanager[4452]: time="2017-02-16T01:07:37-05:00" level=debug msg="Received alert" alert=NOC_MemoryUsage_per_container[bbd0ed6][active] component=dispatcher source="dispatch.go:168"

Alert in prometheus :

NOC_MemoryUsage_per_container (10 active)
ALERT NOC_MemoryUsage_per_container
  IF (container_memory_usage_bytes{container_name=~".+"} / (container_spec_memory_limit_bytes{container_name=~".+"} > 0)) * 100 > 80
  FOR 1m
  LABELS {severity="minor"}
  ANNOTATIONS {description="Container {{ $labels.container_name }} memory use is over 80% in {{ $labels.datacenter }} ({{ $labels.k8scluster }}) (current value: {{ $value }})", summary="Container {{ $labels.container_name }} memory use is over 80% in {{ $labels.datacenter }} ({{ $labels.k8scluster }})"}

I have the errors in a log, may it be a cause :

Feb 16 01:56:37 mon1 alertmanager[4452]: time="2017-02-16T01:56:37-05:00" level=debug msg="Notify attempt 1 failed: unexpected status code 404" source="notify.go:546"
Feb 16 01:56:37 mon1 alertmanager[4452]: time="2017-02-16T01:56:37-05:00" level=error msg="Error on notify: Cancelling notify retry due to unrecoverable error: unexpected status code 404" source="no
Feb 16 01:56:37 mon1 alertmanager[4452]: time="2017-02-16T01:56:37-05:00" level=error msg="Notify for 12 alerts failed: Cancelling notify retry due to unrecoverable error: unexpected status code 404

Could you help me to fix it ? Thanks! Natalia

brian-brazil commented 7 years ago

Do you have the log lines showing the notifications being sent?

natalia-k commented 7 years ago

no, I can't find it in the log but they were resend to webhook every time of reload

rtreffer commented 7 years ago

This seems to happen for us, too. I've added several predictive MySQL capacity alerts that have extremely high timing settings:

ALERT ... FOR 1h
repeat_interval: 108h
group_wait: 5m
group_interval: 12h

I still get quite some complains about 'alert storms': all alerts firing at the same second. It looks like we are hitting the issues described here.

From the log

2017-04-04_07:23:33.61307 time="2017-04-04T07:23:33Z" level=info msg="Loading configuration file" file="/srv/prometheus/alertmanager/alertmanager.yml" source="main.go:200"
2017-04-04_07:23:33.71081 time="2017-04-04T07:23:33Z" level=debug msg="Received alert" alert=MySQLTableSizeWarning[ace3d54][active] component=dispatcher source="dispatch.go:187"
2017-04-04_07:23:33.71119 time="2017-04-04T07:23:33Z" level=debug msg="Received alert" alert=MySQLTableSizeWarning[35c738e][active] component=dispatcher source="dispatch.go:187"
2017-04-04_07:23:33.71135 time="2017-04-04T07:23:33Z" level=debug msg="flushing [MySQLTableSizeWarning[ace3d54][active]]" aggrGroup=c372a46868630640 source="dispatch.go:428"
2017-04-04_07:23:33.71143 time="2017-04-04T07:23:33Z" level=debug msg="flushing [MySQLTableSizeWarning[35c738e][active]]" aggrGroup=8197482606ef9270 source="dispatch.go:428"

The flush happens within less than a second of the config reload

stuartnelson3 commented 7 years ago

Issue seems to be here: https://github.com/prometheus/alertmanager/blob/master/cmd/alertmanager/main.go#L227-L240

Reloading tears down and completely recreates the dispatcher.

brancz commented 7 years ago

@stuartnelson3 that shouldn't be an issue as the notification log should still be populated from previous notifications regardless of the dispatcher being recreated. Maybe loading the notification log from disk races with the notification queue accepting ingestion though.

fabxc commented 7 years ago

This should not be racy. The disk snapshot is fully loaded before the constructor of the notification log returns. That only happens on startup.

Reloading of Alertmanager only rebuilds the pipelines according to the new configuration.

I added a basic test in #716, which works as expected. We need some more information to reproduce this condition.

mxinden commented 7 years ago

I will close here as there has been no further progress. @natalia-k feel free to reopen with further information to reproduce the issue like suggested by @fabxc. Thanks for the bug report!

bharathpantala commented 4 years ago

Can please anyone help me for that remediation steps when we got a alertmanagerconfigreloadfailed alert

szediktam commented 4 years ago

I met the issue.

I set repeat_interval to 2000h. After reload the alertmanager, alertmanager resend the alert immediately, not in repeat interval. But not everytime. Alertmanager 0.17.

@fabxc @mxinden @brancz any idea?

aclowkey commented 4 years ago

Same here Any help?