prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.59k stars 2.14k forks source link

repeat_interval doesn't work if less than group_interval #3370

Closed grobinson-grafana closed 1 year ago

grobinson-grafana commented 1 year ago

What did you do?

It appears that repeat_interval doesn't work if its less than group_interval. I'm not sure if this is on purpose or not, and I couldn't see it documented in https://prometheus.io/docs/alerting/latest/configuration/.

I created the following configuration file:

receivers:
- name: test
  webhook_configs:
  - url: http://127.0.0.1:8080
route:
  receiver: test
  group_wait: 5s
  group_interval: 1m
  repeat_interval: 15s

and sent an alert to Alertmanager using cURL:

ts=2023-05-30T22:47:11.258Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active]
ts=2023-05-30T22:47:16.259Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][active]]
ts=2023-05-30T22:47:16.263Z caller=notify.go:751 level=debug component=dispatcher receiver=test integration=webhook[0] msg="Notify success" attempts=1

The first notification was received:

2023/05/30 22:47:16 {"receiver":"test","status":"firing","alerts":[{"status":"firing","labels":{"foo":"bar"},"annotations":{},"startsAt":"2023-05-30T22:47:11.258557+01:00","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"3fff2c2d7595e046"}],"groupLabels":{},"commonLabels":{"foo":"bar"},"commonAnnotations":{},"externalURL":"http://Georges-Air.fritz.box:9093","version":"4","groupKey":"{}:{}","truncatedAlerts":0}

However, the repeat notification was not sent until 22:48:16, when it should have been sent at 22:47:31:

ts=2023-05-30T22:47:16.263Z caller=notify.go:751 level=debug component=dispatcher receiver=test integration=webhook[0] msg="Notify success" attempts=1
ts=2023-05-30T22:48:16.258Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][active]]
ts=2023-05-30T22:48:16.259Z caller=notify.go:751 level=debug component=dispatcher receiver=test integration=webhook[0] msg="Notify success" attempts=1
2023/05/30 22:48:16 {"receiver":"test","status":"firing","alerts":[{"status":"firing","labels":{"foo":"bar"},"annotations":{},"startsAt":"2023-05-30T22:47:11.258557+01:00","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"3fff2c2d7595e046"}],"groupLabels":{},"commonLabels":{"foo":"bar"},"commonAnnotations":{},"externalURL":"http://Georges-Air.fritz.box:9093","version":"4","groupKey":"{}:{}","truncatedAlerts":0}

The same happens again at 22:49:16:

ts=2023-05-30T22:48:16.259Z caller=notify.go:751 level=debug component=dispatcher receiver=test integration=webhook[0] msg="Notify success" attempts=1
ts=2023-05-30T22:49:16.256Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][active]]
ts=2023-05-30T22:49:16.257Z caller=notify.go:751 level=debug component=dispatcher receiver=test integration=webhook[0] msg="Notify success" attempts=1
2023/05/30 22:49:16 {"receiver":"test","status":"firing","alerts":[{"status":"firing","labels":{"foo":"bar"},"annotations":{},"startsAt":"2023-05-30T22:47:11.258557+01:00","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"3fff2c2d7595e046"}],"groupLabels":{},"commonLabels":{"foo":"bar"},"commonAnnotations":{},"externalURL":"http://Georges-Air.fritz.box:9093","version":"4","groupKey":"{}:{}","truncatedAlerts":0}

If I increase group_interval to 5m, then repeat notifications aren't received until 5 minutes after the first notification:

receivers:
- name: test
  webhook_configs:
  - url: http://127.0.0.1:8080
route:
  receiver: test
  group_wait: 5s
  group_interval: 5m
  repeat_interval: 15s
ts=2023-05-30T22:50:09.391Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[6ad8c19][active]]
ts=2023-05-30T22:50:09.394Z caller=notify.go:751 level=debug component=dispatcher receiver=test integration=webhook[0] msg="Notify success" attempts=1
ts=2023-05-30T21:55:09.380Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[6ad8c19][resolved]]
ts=2023-05-30T22:55:09.381Z caller=notify.go:751 level=debug component=dispatcher receiver=test integration=webhook[0] msg="Notify success" attempts=1

I believe this happens because the aggregation group is flushed after group_interval, so if repeat_interval < group_interval, the earliest a notification can repeat is group_interval.

What did you expect to see?

Notifications repeated once per repeat_interval.

What did you see instead? Under which circumstances?

Notifications repeated once per group_interval (repeat_interval must be less than group_interval).

Environment

I am testing on main, commit https://github.com/prometheus/alertmanager/commit/5adc7369c838c31fcbaa7d413951a2dc01ae87ae.

grobinson-grafana commented 1 year ago

Thinking about this again, I don't know if it makes sense for Alertmanager to allow configurations to contain a repeat_interval < group_interval.

For example, if repeat_interval is 15s and group_interval is 5m, would it make sense to send notifications every 15 seconds, until such time that the alerts in the aggregation group change, upon which Alertmanager will now start a timer for 5m?

gotjosh commented 1 year ago

Thinking about this again, I don't know if it makes sense for Alertmanager to allow configurations to contain a repeat_interval < group_interval.

I agree, I'm not sure how I feel about breaking's people setup on the first run, but at the very minimum, we should print out a warning as the Alertmanager starts.

simonpasquier commented 1 year ago

Late to the party but setting repeat_interval < group_interval is an undocumented way to get repeated notifications at a predictable interval. For instance when using a watchdog/deadmansnitch alert, you can ensure that the notification will be emitted every group interval...

grobinson-grafana commented 1 year ago

Perhaps we should revert the warning I added then, opinions? @gotjosh @simonpasquier

itay-grudev commented 8 months ago

@grobinson-grafana Reverting it really makes sense. Below is the recommended way by Grafana Oncall to implement a heartbeat alert that always triggers events every 50s, while the heartbeat check is evaluated every 1min.

config:
  route:
    routes:
      - match:
          alertname: heartbeat
        receiver: 'grafana-oncall-heartbeat'
        group_wait: 0s
        group_interval: 1m
        repeat_interval: 50s
gotjosh commented 8 months ago

Thanks for the feedback!

I don't think we should revert this - In @simonpasquier's owns words:

is an undocumented way to get repeated notifications at a predictable interval

This signals that this is an exceptional use case and not relevant for most users.

itay-grudev commented 7 months ago

@gotjosh Technically I was suggesting making it a documented way to get notifications at a predictable interval. And maybe even adding a test to preserve the behaviour in future versions as it widely used and recommended by the Grafana Team themselves.