prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.64k stars 2.15k forks source link

Repeated notifications sooner than `repeat_interval` #2320

Open rbs4ba opened 4 years ago

rbs4ba commented 4 years ago

What did you do? Created an alert and a route to Slack with the following config:

group_wait: 10s
group_interval: 5m
repeat_interval: 24h

What did you expect to see? One alert every 24h for the same alert

What did you see instead? Under which circumstances? A repeated alert being sent after some amount of group_interval has passed (5m, 10m, 15m, etc). What I believe is happening is that a new data point appears after some amount of time < group_interval, but it then has resolved itself after the full duration of the group_interval. Additionally I have confirmed that there are no changes in the labels for the alert; I have viewed the query that triggers the alert and the labels stay consistent the whole time.

For example, we have an alert that is triggered when a kubernetes pod is in the state: Error:

I would expect to see either:

Environment

$ prometheus --version
prometheus, version 2.19.2 (branch: HEAD, revision: c448ada63d83002e9c1d2c9f84e09f55a61f0ff7)
  build user:       root@dd72efe1549d
  build date:       20200626-09:02:20
  go version:       go1.14.4
simonpasquier commented 4 years ago

At 10:55.10, the group of alerts has changed because the pod: bar alert without state label isn't the same as the alert with pod: bar, state: Error labels. Hence Alertmanager respects the group_interval timer.

wangfeiping commented 4 years ago

me too

alerts are resented every 10s, configured as follows:

  group_by: ['summary','description']
  group_wait: 5s
  group_interval: 5s 
  repeat_interval: 10m
wangfeiping commented 4 years ago

The last version seems to have solved the problem.

alertmanager, version 0.21.0 (branch: master, revision: 41cd012c61e32ff56471c0393ca59e3d5ef7619c) build user: wang@koynare build date: 20200729-03:31:47 go version: go1.14.2

luguohong commented 3 years ago

me too and here is my config

global:
  resolve_timeout: 5m

route:
  group_by: [alertname]
  group_wait: 10s
  group_interval: 5m
  repeat_interval: 25m
  receiver: 'web-notificate'
receivers:
  - name: 'web-notificate'
    webhook_configs:
      - url: 'http://localhost:10101/test/alert'
        send_resolved: false

i have four files of rules in prometheus, and every file have two rules like:

groups:
- name: gateway_alerts_mtp_mam-api_GW20200805120155391
  interval: 5s
  rules:
  - alert: gateway_alerts_mtp_mam-api_GW20200805120155391_3_alertType
    expr: (sum(kong_latency_sum{route="GW20200805120155391"}) - sum(kong_latency_sum{route="GW20200805120155391"} offset 10s)) / ((sum(kong_latency_count{route="GW20200805120155391"}) - sum(kong_latency_count{route="GW20200805120155391"} offset 10s)) / 3) > 20
    labels:
      route: 'GW20200805120155391'
      alertId: '3'
    annotations:
      summary: ''
      value: '{{$value}}'
  - alert: gateway_alerts_mtp_mam-api_GW20200805120155391_4_alertType
    expr: sum(kong_http_status{service="GW20200805120155391",code!~"2..|3.."}) - sum(kong_http_status{service="GW20200805120155391",code!~"2..|3.."} offset 10s) > 0
    labels:
      route: 'GW20200805120155391'
      alertId: '4'
    annotations:
      summary: ''
      value: '{{$value}}'

by the way, i use prometheus for kong. for example, the rule [gateway_alerts_mtp_mam-api_GW20200805120155391_3_alertType], i am sure that every 5 seconds, the expr value must be greater than 20. on this occasion, at first, i receive firing msg every 30 minutes(which i hope), after a days, i receive msg every 8 minutes!

i am not sure how resolve_timeout, group_wait, group_interval, repeat_interval work.

alertmanager version:

alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d)
  build user:       root@dee35927357f
  build date:       20200617-08:54:02
  go version:       go1.14.4

prometheus version:

prometheus, version 2.23.0 (branch: HEAD, revision: 26d89b4b0776fe4cd5a3656dfa520f119a375273)
  build user:       root@37609b3a0a21
  build date:       20201126-10:56:17
  go version:       go1.15.5
  platform:         linux/amd64

here prometheus config:

global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
alerting:
  alertmanagers:
  - static_configs:
    - targets:
        - localhost:9093
rule_files:
  - "rules/alerts_*.yml"
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'kong-collector'
    static_configs:
      - targets: ['x.x.x.x:8001']
lzh-lab commented 3 years ago

i am not sure how resolve_timeout, group_wait, group_interval, repeat_interval work.

group_wait

How long to wait to buffer alerts of the same group before sending a notification initially.

group_interval

How long to wait before sending an alert that has been added to a group for which there has already been a notification.

repeat_interval

How long to wait before re-sending a given alert that has already been sent in a notification.

https://www.robustperception.io/whats-the-difference-between-group_interval-group_wait-and-repeat_interval

damnever commented 2 years ago

Maybe we should sync the notifyAt state across the cluster.

zhaogaolong commented 2 years ago

I have testing notify with alertmanager, the result is.

alertmanager.yml

  group_wait: 5s
  group_interval: 1m
  repeat_interval: 2m

algorithm : total_notif_interval = group_interval + repeat_interval if start at 00:00:00

Notift times time line notify algorithm *
1 (first alert) 00:00:05 total = group_wait alert start at 00:00:00
2 (no alert add group) 00:03:05 total = group_interval + repeat_interval
3 (no alert add group) 00:06:05 total = group_interval + repeat_interval
4 (new alert add group) 00:07:05 total = group_interval
5 (new alert add group) 00:10:05 total = group_interval + repeat_interval