Open artemlive opened 7 years ago
I forgot to specify versions of a software:
~]# prometheus -version
prometheus, version 1.6.1 (branch: master, revision: 4666df502c0e239ed4aa1d80abbbfb54f61b23c3)
build user: root@ca1b452514b7
build date: 20170419-13:26:24
go version: go1.8.1
~]# alertmanager -version
alertmanager, version 0.8.0 (branch: HEAD, revision: 74e7e48d24bddd2e2a80c7840af9b2de271cc74c)
build user: root@439065dc2905
build date: 20170720-14:14:06
go version: go1.8.3
@artemlive First of all thank you very much for this very detailed bug report!
Before digging further into this, two questions:
repeat_interval
in your Alertmanager configuration?Hello!
Here data from a prometheus: http://joxi.net/Dr81e9eikPQDLm http://joxi.net/1A51eJeiK1g1GA
Everything began in: ~]# date -d @1503000569.746 Thu Aug 17 20:09:29 UTC 2017 AND ended in: ~]# date -d @1503002909.746 Thu Aug 17 20:48:29 UTC 2017
Many thanks for your response!
This is quite confusing and sounds like a bug. Are you able to reproduce this with a minimal config? It is difficult for me to reconstruct your exact environment for debugging.
I'll try to reproduce it today or tomorrow. Thanks!
I have this problem, too. My alertmanager version is 0.5.1 and config is
global:
smtp_smarthost: 'localhost:25'
smtp_from: 'alertmanager@example.org'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
hipchat_auth_token: '1234556789'
hipchat_url: 'https://hipchat.foobar.org/'
templates:
- '/etc/alertmanager/template/*.tmpl'
route:
group_by: ['alertname','cluster','service']
group_wait: 30s
group_interval: 5m
repeat_interval: 1m
receiver: webhook
routes:
- match_re:
service: ^(foo1|foo2|baz)$
receiver: webhook
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname','cluster','service']
receivers:
- name: 'webhook'
webhook_configs:
- url: http://alert-collector:17130/alert/v1.0/alerts
We are currently on v0.11.0. I recommend you update.
We have upgraded to v0.11.0 but still have this problem. @stuartnelson3
If you create several alerts with the same name, this problem is reproduced. I added another debug recipient in an alertmanager and it seems I figured out what the problem is.(https://github.com/prometheus/alertmanager/blob/master/examples/webhook/echo.go) In my case it was like this:
"receiver": "debug", "status": "resolved", "alerts": [ { "status": "resolved", "labels": { "alertname": "TCP_CHECK", "instance": "127.0.0.1:9115", "job": "tcp_service_rmi", "severity": "sms" }, "annotations": { "description": "RMI port 8080 on myserver is unreachable", "summary": "RMI port 8080 on myserver is unreachable " }, "startsAt": "2018-01-12T13:10:37.803Z", "endsAt": "2018-01-12T13:42:37.804Z", "generatorURL": "http://prometheus:9090/graph?g0.expr=probe_myserver2_8080_success%7Bjob%3D%22tcp_service_rmi%22%7D+%21%3D+1\u0026g0.tab=0" } ],
generatorURL data is displayed from another service for the same alert. I added the unique alert name for each alert and now the problem is not reproduced.
some kind of magic... There is an alert added in prometheus:
- alert: SS_CONFIGURATIONS_ADS expr: probe_ss_config{job=~".*_ui_health"} != 1 for: 5m labels: severity: critical annotations: description: SS_CONFIGURATIONS service {{ $labels.name }} on {{ $labels.network }}) on server {{ $labels.server }} summary: configs {{ $labels.name }}{{ $labels.network }}
There is an element of data named probe_ss_config. Here is the result of query, which returns sum of all checks for 2h, which equals to 0, means that service was down (all checks for 2h retruned 0). Vector of data for 2h (all checks is equal to 0) http://joxi.net/n2YB707ijzlYEA And sum of all http://joxi.net/8AnWDNDTqJpqor elements is equal to 0. But sometimes, prometheus sends "resolved" status for an alert:
"receiver": "debug", "status": "resolved", "alerts": [ { "status": "resolved", "labels": { "alertname": "SS_CONFIGURATIONS_ADS", "domain": "xxxx.myserver.com", "instance": "http://ns7.myserver.com", "job": "ns7.myserver.com_xxxx.myserver.com_ui_health", "name": "AnalyticsLogConfig", "network": "xxxx.myserver.com", "server": "ns7.myserver.com", "service": "ss_configurations", "severity": "critical", "type": "health_check" }, "annotations": { "description": "SS_CONFIGURATIONS service AnalyticsLogConfig on xxxx.myserver.com) on server ns7.myserver.com", "summary": "configs AnalyticsLogConfigxxxx.myserver.com" }, "startsAt": "2018-01-23T14:32:05.524923261Z", "endsAt": "2018-01-23T14:33:05.524853742Z", "generatorURL": "http://prometheus-high-cpu:9090/graph?g0.expr=probe_ss_config%7Bjob%3D~%22.%2A_ui_health%22%7D+%21%3D+1\u0026g0.tab=1" } ], "groupLabels": { "alertname": "SS_CONFIGURATIONS_ADS" }, "commonLabels": { "alertname": "SS_CONFIGURATIONS_ADS", "domain": "xxxx.myserver.com", "instance": "http://ns7.myserver.com", "job": "ns7.myserver.com_xxxx.myserver.com_ui_health", "name": "AnalyticsLogConfig", "network": "xxxx.myserver.com", "server": "ns7.myserver.com", "service": "ss_configurations", "severity": "critical", "type": "health_check" }, "commonAnnotations": { "description": "SS_CONFIGURATIONS service AnalyticsLogConfig on xxxx.myserver.com) on server ns7.myserver.com", "summary": "configs AnalyticsLogConfigxxxx.myserver.com" }, "externalURL": "http://prometheus.myserver.com:9095", "version": "4", "groupKey": "{}/{severity=~\"^(?:critical|sms)$\"}:{alertname=\"SS_CONFIGURATIONS_ADS\"}" }
But probe_ss_config is equal to 0, as I showed in the screenshot below. prometheus, version 2.0.0 (branch: HEAD, revision: 0a74f98628a0463dddc90528220c94de5032d1a0) build user: root@615b82cb36b6 build date: 20171108-07:11:59 go version: go1.9.2
alertmanager, version 0.13.0 (branch: HEAD, revision: fb713f6d8239b57c646cae30f78e8b4b8861a1aa) build user: root@d83981af1d3d build date: 20180112-10:32:46 go version: go1.9.2
@artemlive do you still see the issue with the latest AlertManager release?
@simonpasquier I can confirm this is not resolved with the latest release.
@justinjk007 you would have to open another issue to provide your specific details.
@artemlive I've read again your description and it doesn't seem related to AlertManager since it acts properly on the information received from Prometheus. If anything, it is a problem with the alerting engine though I'm suspicious that Prometheus computes wrongly the alert's status (it might be that you've got a failed scrape which resets the alert). To alleviate flapping, you could use an aggregation over time function like this: min_over_time(probe_ss_config{job=~".*_ui_health"}[5m]) != 1
.
@simonpasquier opened #1546
I too have what I think may be the same issue. Prometheus continues to report the alert in active state to AlertManager, however AlertManager reports the alert as closed and then re-creates it minutes later.
The Alertmanager logs at debug level (ts and aggrGroup removed to reduce noise):
Feb 25 19:55:56 host alertmanager[16012]: level=debug caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][active]]
Feb 25 19:57:29 host alertmanager[16012]: level=debug caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=LinkDown[48042ac][active]
Feb 25 19:58:56 host alertmanager[16012]: level=debug caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=LinkDown[48042ac][active]
Feb 25 19:58:56 host alertmanager[16012]: level=debug caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][active]]
Feb 25 19:58:56 host alertmanager[16012]: level=debug caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][active]]
Feb 25 20:01:56 host alertmanager[16012]: level=debug caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][resolved]]
Feb 25 20:01:56 host alertmanager[16012]: level=debug caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][resolved]]
Feb 25 20:01:57 host alertmanager[16012]: level=debug caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=LinkDown[48042ac][active]
Feb 25 20:03:26 host alertmanager[16012]: level=debug caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=LinkDown[48042ac][active]
Feb 25 20:04:56 host alertmanager[16012]: level=debug caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=LinkDown[48042ac][active]
Feb 25 20:04:56 host alertmanager[16012]: level=debug caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][active]]
Feb 25 20:04:56 host alertmanager[16012]: level=debug caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][active]]
The Alert state goes from active to resolved and back again whilst Prometheus itself continues to report the alert is active. There is a corresponding gap in the 'Received alert' messages from Prometheus due to the Prometheus configuration being reloaded which happens frequently in our environment.
The gap in messages is less than the resolve_timeout.
AlertManager Details:
GoVersion: go1.10.3
Revision: 462c969d85cf1a473587754d55e4a3c4a2abc63c
Version: 0.15.0
Config Summary:
global:
resolve_timeout: 10m
http_config: {}
route:
receiver: ops_team
group_by:
- alertname
- site_code
- cluster
- instance
- instance_id
- resource_id
- backend
- service
- interface_id
routes:
- receiver: ops
match:
team: ops_team
continue: true
group_wait: 0s
group_interval: 3m
repeat_interval: 2h
I will update to AlertManager 0.16.1 and see if it helps, however I don't see anything related to this in the Changelog.
Further investigations on this show that the issue for us was introduced in Prometheus 2.4, #4550
Prometheus now sends the endsAt
field to the Alertmanager with a very short timeout until AlertManager can mark the alert as resolved. This overrides the resolve_timeout
setting in AlertManager and creates the firing->resolved->firing behavior if Prometheus does not resend the alert before the short timeout.
We will try increasing the 'resend_delay' on Prometheus to 2m and see if there are any side effects.
I will update to AlertManager 0.16.1 and see if it helps, however I don't see anything related to this in the Changelog.
@steve-exley Yes, please try anything above and including 0.15.3 as those include https://github.com/prometheus/alertmanager/pull/1611.
We still had some of these issues on 0.16.1, the underlying cause is that the endsAt
time period generated by Prometheus does not leave much margin for error and AlertManager resolves the Alert if it doesn't get that next update from Prometheus in time.
I think changing the Prometheus 'rules.alert.resend-delay
to 2m will resolve this in our scenario because the endsAt
value is calculated from that. It's been less than 24hr so far since we made the change and too early to be sure.
I also raised a Proposal on Prometheus #5277 to discuss whether the endsAt
time period is too short.
generatorURL data is displayed from another service for the same alert. I added the unique alert name for each alert and now the problem is not reproduced.
I can confirm the same behavior on:
prometheus, version 2.3.2+ds (branch: debian/sid, revision: 2.3.2+ds-1~bpo9+1)
alertmanager, version 0.15.3+ds (branch: debian/sid, revision: 0.15.3+ds-2~bpo9+1)
Unique alert names fix the issue.
+1
mark
same problem:
alertmanager version: v0.18.0 prometheus version: v2.17.1
level=debug ts=2020-07-29T09:11:23.373Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[67399ab][active] level=debug ts=2020-07-29T09:11:23.373Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[1aa03ad][active] level=debug ts=2020-07-29T09:11:23.373Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[cadfe5a][resolved] level=debug ts=2020-07-29T09:11:28.278Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[61b70c9][resolved] level=debug ts=2020-07-29T09:11:28.278Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[9ff2c20][active] level=debug ts=2020-07-29T09:11:33.321Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[18f376c][active] level=debug ts=2020-07-29T09:11:38.304Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[fbd7908][resolved] level=debug ts=2020-07-29T09:11:38.304Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[9ff2c20][resolved] level=debug ts=2020-07-29T09:11:43.321Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[faf133d][active] level=debug ts=2020-07-29T09:11:43.322Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[9d7d714][resolved]
Hello! I cannot deal with one problem, since the moment when began to use prometheus. I use blackbox_exporter for monitoring of accessibility of the domain on http and also alertmanager for sending messages to slack, telegram and the SMS. Sometimes there is a situation when the domain is unavailable on http and prometheus sends the correct notification messages that the server is unavailable, but the message with the status of "resolved" at some point comes when the server is unavailable at this time, and through some time the message with the status of "firing" comes again. The configuration is given below: Prometheus (part of config):
blackbox.yml
alertmanager.yml
alert.rules
Part of alertmanager journal (alert HTTPProblemOnUI[4f4a71c][active] ):
Here it went into state "resolved" (but the query returns that the check at this point returns probe_success 0, i'll show it below)
Then again "active":
Here data from a prometheus: http://joxi.net/Dr81e9eikPQDLm http://joxi.net/1A51eJeiK1g1GA
Everything began in:
AND ended in:
There are data which to this interval were sent by prometheus to alertmanager:
Then for some reason came alert with "endsAt" and respectively the notification message "resolved" came:
And then the notification message with the status "firing" came again:
If any additional information is necessary, I am ready to provide it. One nuance, at us is used many alerts with an identical name (for group of messages), I provided only one type of alert (HTTPProblemOnUI). Such problems happen not only with alerts of this group. Thank you! P.S. sorry for my poor English :)