Firing -> Resolved -> Firing issue with prometheus + alertmanager

artemlive commented 7 years ago

Hello! I cannot deal with one problem, since the moment when began to use prometheus. I use blackbox_exporter for monitoring of accessibility of the domain on http and also alertmanager for sending messages to slack, telegram and the SMS. Sometimes there is a situation when the domain is unavailable on http and prometheus sends the correct notification messages that the server is unavailable, but the message with the status of "resolved" at some point comes when the server is unavailable at this time, and through some time the message with the status of "firing" comes again. The configuration is given below: Prometheus (part of config):

global:
  scrape_interval: 10s
  scrape_timeout: 8s
  evaluation_interval: 1m
rule_files:
  - /etc/prometheus/production.rules
scrape_configs:
- job_name: 'myserver.com_n10.myserver.com_ui'
  scrape_interval: 30s
  metrics_path: /probe
  params:
    module: [ myserver.com_n10.myserver.com_ui ]  # Look for a HTTP 200 response.
  static_configs:
  - targets:
    - myserver.com
    labels:
      server: 'myserver.com'
      network: 'n10.myserver.com'
      domain: 'n10.myserver.com'
      type: 'http_ui'
  relabel_configs:
    - source_labels: [__address__]
      regex: (.*)(:80)?
      target_label: __param_target
      replacement: http://${1}/login
    - source_labels: [__param_target]
      regex: (.*)
      target_label: instance
      replacement: ${1}
    - source_labels: []
      regex: .*
      target_label: __address__
      replacement: 127.0.0.1:9115  # Blackbox exporter.

blackbox.yml

myserver.com_n10.myserver.com_ui:
    prober: http
    timeout: 10s
    http:
     valid_status_codes: [ 200, 302, 301]  # Defaults to 2xx
     method: GET
     headers:
       Host: someheader.com
     no_follow_redirects: true
     fail_if_ssl: false
     tls_config:
       insecure_skip_verify: false
     protocol: "tcp"
     preferred_ip_protocol: "ip4"

alertmanager.yml

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'smtp9.myserver.com:25'
  smtp_from: 'alertmanager@myserver.com'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'
  resolve_timeout: 1h
route:
  group_by: ['alertname', 'instance', 'type']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 5m

route:
 receiver: slack_general
 routes:
  - match:
      severity: sms
    receiver: sms_admins
    continue: true
  - match:
      severity: warning
    receiver: slack_general
    repeat_interval: 1h
    continue: true
  - match:
      severity: warning
    receiver: admins
    repeat_interval: 1h
  - match_re:
      severity: critical|sms
    receiver: slack_general
    continue: true
  - match_re:
      severity: critical|sms
    receiver: admins
receivers:
- name: slack_general
  slack_configs:
  - api_url: https://hooks.slack.com/services/SLACK_URI
    channel: '#notifications'
    send_resolved: true
    title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] '
    text:  '{{ range $val := .Alerts }}
             Alert: <http://myserver.com:9095/#/alerts|{{ index $val.Annotations "description" }}>
             {{ end}}'
- name: admins
  webhook_configs:
  - send_resolved: True
    url: http://127.0.0.1:9088/alert/telegram_id1,telegram_id2
templates:
- '/etc/prometheus/alertmanager/templates/default.tmpl'
#test for rule inhibition, used for free space trigger for now
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance', 'mountpoint']

alert.rules

ALERT HTTPProblemOnUI
 IF probe_success{job="myserver.com_n10.myserver.com_ui"}  != 1
 FOR 5m
 LABELS { severity = "critical" }
 ANNOTATIONS {
 summary = "HTTP check of myserver.com on server myserver.com",
 description = "HTTP check of myserver.com (UI domain on myserver.com) on server s43.myserver.com",
}

Part of alertmanager journal (alert HTTPProblemOnUI[4f4a71c][active] ):

Aug 17 20:14:38 myserver.com alertmanager[13966]: time="2017-08-17T20:14:38Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][active] component=dispatcher source="dispatch.go:184"
Aug 17 20:15:08 myserver.com alertmanager[13966]: time="2017-08-17T20:15:08Z" level=debug msg="flushing [HTTPProblemOnUI[ff7e520][active] HTTPProblemOnUI[31bfcf0][active] HTTPProblemOnUI[7eb9f05][active] HTTPProblemOnUI[abcb7d9][active] HTTPProblemOnUI[e3c0fa7][active] HTTPProblemOnUI[c7c16dc][active] HTTPProblemOnUI[4215379][active] HTTPProblemOnUI[c158c7c][active] HTTPProblemOnUI[f5691c0][active] HTTPProblemOnUI[4f4a71c][active] HTTPProblemOnUI[ee6848c][active] HTTPProblemOnUI[8c65933][active] HTTPProblemOnUI[82dcc81][active] HTTPProblemOnUI[7da68da][active]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="HTTPProblemOnUI"} source="dispatch.go:426"
Aug 17 20:15:08 myserver.com alertmanager[13966]: time="2017-08-17T20:15:08Z" level=debug msg="flushing [HTTPProblemOnUI[ff7e520][active] HTTPProblemOnUI[c7c16dc][active] HTTPProblemOnUI[f5691c0][active] HTTPProblemOnUI[8c65933][active] HTTPProblemOnUI[4215379][active] HTTPProblemOnUI[7eb9f05][active] HTTPProblemOnUI[abcb7d9][active] HTTPProblemOnUI[e3c0fa7][active] HTTPProblemOnUI[c158c7c][active] HTTPProblemOnUI[82dcc81][active] HTTPProblemOnUI[4f4a71c][active] HTTPProblemOnUI[ee6848c][active] HTTPProblemOnUI[31bfcf0][active] HTTPProblemOnUI[7da68da][active]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="HTTPProblemOnUI"} source="dispatch.go:426"
Aug 17 20:15:38 myserver.com alertmanager[13966]: time="2017-08-17T20:15:38Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][active] component=dispatcher source="dispatch.go:184"
Aug 17 20:16:38 myserver.com alertmanager[13966]: time="2017-08-17T20:16:38Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][active] component=dispatcher source="dispatch.go:184"

Here it went into state "resolved" (but the query returns that the check at this point returns probe_success 0, i'll show it below)

...
Aug 17 20:33:41 myserver.com alertmanager[13966]: time="2017-08-17T20:33:41Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][active] component=dispatcher source="dispatch.go:184"
Aug 17 20:34:38 myserver.com alertmanager[13966]: time="2017-08-17T20:34:38Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][resolved] component=dispatcher source="dispatch.go:184"
...

Then again "active":

Aug 17 20:48:38 myserver.com alertmanager[13966]: time="2017-08-17T20:48:38Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][active] component=dispatcher source="dispatch.go:184"

Here data from a prometheus: http://joxi.net/Dr81e9eikPQDLm http://joxi.net/1A51eJeiK1g1GA

Everything began in:

~]# date -d @1503000569.746
Thu Aug 17 20:09:29 UTC 2017

AND ended in:

~]# date -d @1503002909.746
Thu Aug 17 20:48:29 UTC 2017

There are data which to this interval were sent by prometheus to alertmanager:

[{"labels":{"alertname":"HTTPProblemOnUI","domain":"n10.myserver.com","instance":"http://myserver.com/login","job":"myserver.com_n10.myserver.com_ui","network":"n10.myserver.com","server":"myserver.com","severity":"critical","type":"http_ui"},"annotations":{"description":"HTTP check of myserver.com on server myserver.com","summary":"HTTP check of myserver.com (UI domain on myserver.com) on server s43.myserver.com"},"startsAt":"2017-08-17T20:14:38.52Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://myserver.com:9090/graph"}]

Then for some reason came alert with "endsAt" and respectively the notification message "resolved" came:

[{"labels":{"alertname":"HTTPProblemOnUI","domain":"n10.myserver.com","instance":"http://myserver.com/login","job":"myserver.com_n10.myserver.com_ui","network":"n10.myserver.com","server":"myserver.com","severity":"critical","type":"http_ui"},"annotations":{"description":"HTTP check of myserver.com on server myserver.com","summary":"HTTP check of myserver.com (UI domain on myserver.com) on server s43.myserver.com"},"startsAt":"2017-08-17T20:14:38.52Z","endsAt":"2017-08-17T20:34:38.52Z","generatorURL":"http://myserver.com:9090/graph"}]

And then the notification message with the status "firing" came again:

[{"labels":{"alertname":"HTTPProblemOnUI","domain":"n10.myserver.com","instance":"http://myserver.com/login","job":"myserver.com_n10.myserver.com_ui","network":"n10.myserver.com","server":"myserver.com","severity":"critical","type":"http_ui"},"annotations":{"description":"HTTP check of myserver.com on server myserver.com","summary":"HTTP check of myserver.com (UI domain on myserver.com) on server s43.myserver.com"},"startsAt":"2017-08-17T20:47:38.52Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://myserver.com:9090/graph"}]

If any additional information is necessary, I am ready to provide it. One nuance, at us is used many alerts with an identical name (for group of messages), I provided only one type of alert (HTTPProblemOnUI). Such problems happen not only with alerts of this group. Thank you! P.S. sorry for my poor English :)

artemlive commented 7 years ago

I forgot to specify versions of a software:

~]# prometheus -version
prometheus, version 1.6.1 (branch: master, revision: 4666df502c0e239ed4aa1d80abbbfb54f61b23c3)
  build user:       root@ca1b452514b7
  build date:       20170419-13:26:24
  go version:       go1.8.1

~]# alertmanager -version
alertmanager, version 0.8.0 (branch: HEAD, revision: 74e7e48d24bddd2e2a80c7840af9b2de271cc74c)
  build user:       root@439065dc2905
  build date:       20170720-14:14:06
  go version:       go1.8.3

mxinden commented 7 years ago

@artemlive First of all thank you very much for this very detailed bug report!

Before digging further into this, two questions:

Is there a particular reason why you specify such a small repeat_interval in your Alertmanager configuration?
Can you assure, that the target (myserver.com) was down the entire time between the first firing alert, the resolved notification and the second firing alert?

artemlive commented 7 years ago

Hello!

These services are very critical for us, therefore we decided to set such repeat_interval (This interval was selected since we used the same interval on other monitoring system with which we migrate on prometheus+alertmanager now, most likely we will increase this interval further)
Yes, it is sure. I wrote result of a request from a prometeus, in this period service was down.

Here data from a prometheus: http://joxi.net/Dr81e9eikPQDLm http://joxi.net/1A51eJeiK1g1GA

Everything began in: ~]# date -d @1503000569.746 Thu Aug 17 20:09:29 UTC 2017 AND ended in: ~]# date -d @1503002909.746 Thu Aug 17 20:48:29 UTC 2017

Many thanks for your response!

mxinden commented 7 years ago

This is quite confusing and sounds like a bug. Are you able to reproduce this with a minimal config? It is difficult for me to reconstruct your exact environment for debugging.

artemlive commented 7 years ago

I'll try to reproduce it today or tomorrow. Thanks!

codwu commented 6 years ago

I have this problem, too. My alertmanager version is 0.5.1 and config is

global:
    smtp_smarthost: 'localhost:25'
    smtp_from: 'alertmanager@example.org'
    smtp_auth_username: 'alertmanager'
    smtp_auth_password: 'password'
    hipchat_auth_token: '1234556789'
    hipchat_url: 'https://hipchat.foobar.org/'
templates:
- '/etc/alertmanager/template/*.tmpl'
route:
    group_by: ['alertname','cluster','service']
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 1m
    receiver: webhook
    routes:
    - match_re:
            service: ^(foo1|foo2|baz)$
       receiver: webhook
inhibit_rules:
- source_match:
       severity: 'critical'
   target_match:
       severity: 'warning'
   equal: ['alertname','cluster','service']

receivers:
- name: 'webhook'
  webhook_configs:
  - url: http://alert-collector:17130/alert/v1.0/alerts

stuartnelson3 commented 6 years ago

We are currently on v0.11.0. I recommend you update.

illusion202 commented 6 years ago

We have upgraded to v0.11.0 but still have this problem. @stuartnelson3

artemlive commented 6 years ago

If you create several alerts with the same name, this problem is reproduced. I added another debug recipient in an alertmanager and it seems I figured out what the problem is.(https://github.com/prometheus/alertmanager/blob/master/examples/webhook/echo.go) In my case it was like this:

"receiver": "debug", "status": "resolved", "alerts": [ { "status": "resolved", "labels": { "alertname": "TCP_CHECK", "instance": "127.0.0.1:9115", "job": "tcp_service_rmi", "severity": "sms" }, "annotations": { "description": "RMI port 8080 on myserver is unreachable", "summary": "RMI port 8080 on myserver is unreachable " }, "startsAt": "2018-01-12T13:10:37.803Z", "endsAt": "2018-01-12T13:42:37.804Z", "generatorURL": "http://prometheus:9090/graph?g0.expr=probe_myserver2_8080_success%7Bjob%3D%22tcp_service_rmi%22%7D+%21%3D+1\u0026g0.tab=0" } ],

generatorURL data is displayed from another service for the same alert. I added the unique alert name for each alert and now the problem is not reproduced.

artemlive commented 6 years ago

some kind of magic... There is an alert added in prometheus:

alert: SS_CONFIGURATIONS_ADS expr: probe_ss_config{job=~".*_ui_health"} != 1 for: 5m labels: severity: critical annotations: description: SS_CONFIGURATIONS service {{ $labels.name }} on {{ $labels.network }}) on server {{ $labels.server }} summary: configs {{ $labels.name }}{{ $labels.network }}

There is an element of data named probe_ss_config. Here is the result of query, which returns sum of all checks for 2h, which equals to 0, means that service was down (all checks for 2h retruned 0). Vector of data for 2h (all checks is equal to 0) http://joxi.net/n2YB707ijzlYEA And sum of all http://joxi.net/8AnWDNDTqJpqor elements is equal to 0. But sometimes, prometheus sends "resolved" status for an alert:

"receiver": "debug", "status": "resolved", "alerts": [ { "status": "resolved", "labels": { "alertname": "SS_CONFIGURATIONS_ADS", "domain": "xxxx.myserver.com", "instance": "http://ns7.myserver.com", "job": "ns7.myserver.com_xxxx.myserver.com_ui_health", "name": "AnalyticsLogConfig", "network": "xxxx.myserver.com", "server": "ns7.myserver.com", "service": "ss_configurations", "severity": "critical", "type": "health_check" }, "annotations": { "description": "SS_CONFIGURATIONS service AnalyticsLogConfig on xxxx.myserver.com) on server ns7.myserver.com", "summary": "configs AnalyticsLogConfigxxxx.myserver.com" }, "startsAt": "2018-01-23T14:32:05.524923261Z", "endsAt": "2018-01-23T14:33:05.524853742Z", "generatorURL": "http://prometheus-high-cpu:9090/graph?g0.expr=probe_ss_config%7Bjob%3D~%22.%2A_ui_health%22%7D+%21%3D+1\u0026g0.tab=1" } ], "groupLabels": { "alertname": "SS_CONFIGURATIONS_ADS" }, "commonLabels": { "alertname": "SS_CONFIGURATIONS_ADS", "domain": "xxxx.myserver.com", "instance": "http://ns7.myserver.com", "job": "ns7.myserver.com_xxxx.myserver.com_ui_health", "name": "AnalyticsLogConfig", "network": "xxxx.myserver.com", "server": "ns7.myserver.com", "service": "ss_configurations", "severity": "critical", "type": "health_check" }, "commonAnnotations": { "description": "SS_CONFIGURATIONS service AnalyticsLogConfig on xxxx.myserver.com) on server ns7.myserver.com", "summary": "configs AnalyticsLogConfigxxxx.myserver.com" }, "externalURL": "http://prometheus.myserver.com:9095", "version": "4", "groupKey": "{}/{severity=~\"^(?:critical|sms)$\"}:{alertname=\"SS_CONFIGURATIONS_ADS\"}" }

But probe_ss_config is equal to 0, as I showed in the screenshot below. prometheus, version 2.0.0 (branch: HEAD, revision: 0a74f98628a0463dddc90528220c94de5032d1a0) build user: root@615b82cb36b6 build date: 20171108-07:11:59 go version: go1.9.2

alertmanager, version 0.13.0 (branch: HEAD, revision: fb713f6d8239b57c646cae30f78e8b4b8861a1aa) build user: root@d83981af1d3d build date: 20180112-10:32:46 go version: go1.9.2

simonpasquier commented 6 years ago

@artemlive do you still see the issue with the latest AlertManager release?

justinjk007 commented 6 years ago

@simonpasquier I can confirm this is not resolved with the latest release.

simonpasquier commented 6 years ago

@justinjk007 you would have to open another issue to provide your specific details.

@artemlive I've read again your description and it doesn't seem related to AlertManager since it acts properly on the information received from Prometheus. If anything, it is a problem with the alerting engine though I'm suspicious that Prometheus computes wrongly the alert's status (it might be that you've got a failed scrape which resets the alert). To alleviate flapping, you could use an aggregation over time function like this: min_over_time(probe_ss_config{job=~".*_ui_health"}[5m]) != 1.

justinjk007 commented 6 years ago

@simonpasquier opened #1546

steve-exley commented 5 years ago

I too have what I think may be the same issue. Prometheus continues to report the alert in active state to AlertManager, however AlertManager reports the alert as closed and then re-creates it minutes later.

The Alertmanager logs at debug level (ts and aggrGroup removed to reduce noise):

Feb 25 19:55:56 host alertmanager[16012]: level=debug  caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][active]]
Feb 25 19:57:29 host alertmanager[16012]: level=debug caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=LinkDown[48042ac][active]
Feb 25 19:58:56 host alertmanager[16012]: level=debug caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=LinkDown[48042ac][active]
Feb 25 19:58:56 host alertmanager[16012]: level=debug caller=dispatch.go:444 component=dispatcher  msg=Flushing alerts=[LinkDown[48042ac][active]]
Feb 25 19:58:56 host alertmanager[16012]: level=debug caller=dispatch.go:444 component=dispatcher  msg=Flushing alerts=[LinkDown[48042ac][active]]
Feb 25 20:01:56 host alertmanager[16012]: level=debug caller=dispatch.go:444 component=dispatcher  msg=Flushing alerts=[LinkDown[48042ac][resolved]]
Feb 25 20:01:56 host alertmanager[16012]: level=debug caller=dispatch.go:444 component=dispatcher  msg=Flushing alerts=[LinkDown[48042ac][resolved]]
Feb 25 20:01:57 host alertmanager[16012]: level=debug caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=LinkDown[48042ac][active]
Feb 25 20:03:26 host alertmanager[16012]: level=debug caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=LinkDown[48042ac][active]
Feb 25 20:04:56 host alertmanager[16012]: level=debug caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=LinkDown[48042ac][active]
Feb 25 20:04:56 host alertmanager[16012]: level=debug caller=dispatch.go:444 component=dispatcher  msg=Flushing alerts=[LinkDown[48042ac][active]]
Feb 25 20:04:56 host alertmanager[16012]: level=debug caller=dispatch.go:444 component=dispatcher  msg=Flushing alerts=[LinkDown[48042ac][active]]

The Alert state goes from active to resolved and back again whilst Prometheus itself continues to report the alert is active. There is a corresponding gap in the 'Received alert' messages from Prometheus due to the Prometheus configuration being reloaded which happens frequently in our environment.

The gap in messages is less than the resolve_timeout.

AlertManager Details:

GoVersion: go1.10.3  
Revision: 462c969d85cf1a473587754d55e4a3c4a2abc63c
Version: 0.15.0

Config Summary:

global:
  resolve_timeout: 10m
  http_config: {}

route:
  receiver: ops_team
  group_by:
  - alertname
  - site_code
  - cluster
  - instance
  - instance_id
  - resource_id
  - backend
  - service
  - interface_id
  routes:
  - receiver: ops
    match:
      team: ops_team
    continue: true
  group_wait: 0s
  group_interval: 3m
  repeat_interval: 2h

I will update to AlertManager 0.16.1 and see if it helps, however I don't see anything related to this in the Changelog.

steve-exley commented 5 years ago

Further investigations on this show that the issue for us was introduced in Prometheus 2.4, #4550

Prometheus now sends the endsAt field to the Alertmanager with a very short timeout until AlertManager can mark the alert as resolved. This overrides the resolve_timeout setting in AlertManager and creates the firing->resolved->firing behavior if Prometheus does not resend the alert before the short timeout.

We will try increasing the 'resend_delay' on Prometheus to 2m and see if there are any side effects.

mxinden commented 5 years ago

I will update to AlertManager 0.16.1 and see if it helps, however I don't see anything related to this in the Changelog.

@steve-exley Yes, please try anything above and including 0.15.3 as those include https://github.com/prometheus/alertmanager/pull/1611.

steve-exley commented 5 years ago

We still had some of these issues on 0.16.1, the underlying cause is that the endsAt time period generated by Prometheus does not leave much margin for error and AlertManager resolves the Alert if it doesn't get that next update from Prometheus in time.

I think changing the Prometheus 'rules.alert.resend-delay to 2m will resolve this in our scenario because the endsAt value is calculated from that. It's been less than 24hr so far since we made the change and too early to be sure.

I also raised a Proposal on Prometheus #5277 to discuss whether the endsAt time period is too short.

fialamarek commented 5 years ago

generatorURL data is displayed from another service for the same alert. I added the unique alert name for each alert and now the problem is not reproduced.

I can confirm the same behavior on:

prometheus, version 2.3.2+ds (branch: debian/sid, revision: 2.3.2+ds-1~bpo9+1)

alertmanager, version 0.15.3+ds (branch: debian/sid, revision: 0.15.3+ds-2~bpo9+1)

Unique alert names fix the issue.

clhlc commented 5 years ago

+1

xuzhenhai commented 4 years ago

mark

xkfen commented 4 years ago

same problem:

alertmanager version: v0.18.0 prometheus version: v2.17.1

level=debug ts=2020-07-29T09:11:23.373Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[67399ab][active] level=debug ts=2020-07-29T09:11:23.373Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[1aa03ad][active] level=debug ts=2020-07-29T09:11:23.373Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[cadfe5a][resolved] level=debug ts=2020-07-29T09:11:28.278Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[61b70c9][resolved] level=debug ts=2020-07-29T09:11:28.278Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[9ff2c20][active] level=debug ts=2020-07-29T09:11:33.321Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[18f376c][active] level=debug ts=2020-07-29T09:11:38.304Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[fbd7908][resolved] level=debug ts=2020-07-29T09:11:38.304Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[9ff2c20][resolved] level=debug ts=2020-07-29T09:11:43.321Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[faf133d][active] level=debug ts=2020-07-29T09:11:43.322Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[9d7d714][resolved]

aemakeye commented 4 years ago

i think i faced similar in my 2406

prometheus / alertmanager

Firing -> Resolved -> Firing issue with prometheus + alertmanager #952