Open rbs4ba opened 4 years ago
At 10:55.10, the group of alerts has changed because the pod: bar
alert without state
label isn't the same as the alert with pod: bar, state: Error
labels. Hence Alertmanager respects the group_interval
timer.
me too
alerts are resented every 10s, configured as follows:
group_by: ['summary','description']
group_wait: 5s
group_interval: 5s
repeat_interval: 10m
The last version seems to have solved the problem.
alertmanager, version 0.21.0 (branch: master, revision: 41cd012c61e32ff56471c0393ca59e3d5ef7619c) build user: wang@koynare build date: 20200729-03:31:47 go version: go1.14.2
me too and here is my config
global:
resolve_timeout: 5m
route:
group_by: [alertname]
group_wait: 10s
group_interval: 5m
repeat_interval: 25m
receiver: 'web-notificate'
receivers:
- name: 'web-notificate'
webhook_configs:
- url: 'http://localhost:10101/test/alert'
send_resolved: false
i have four files of rules in prometheus, and every file have two rules like:
groups:
- name: gateway_alerts_mtp_mam-api_GW20200805120155391
interval: 5s
rules:
- alert: gateway_alerts_mtp_mam-api_GW20200805120155391_3_alertType
expr: (sum(kong_latency_sum{route="GW20200805120155391"}) - sum(kong_latency_sum{route="GW20200805120155391"} offset 10s)) / ((sum(kong_latency_count{route="GW20200805120155391"}) - sum(kong_latency_count{route="GW20200805120155391"} offset 10s)) / 3) > 20
labels:
route: 'GW20200805120155391'
alertId: '3'
annotations:
summary: ''
value: '{{$value}}'
- alert: gateway_alerts_mtp_mam-api_GW20200805120155391_4_alertType
expr: sum(kong_http_status{service="GW20200805120155391",code!~"2..|3.."}) - sum(kong_http_status{service="GW20200805120155391",code!~"2..|3.."} offset 10s) > 0
labels:
route: 'GW20200805120155391'
alertId: '4'
annotations:
summary: ''
value: '{{$value}}'
by the way, i use prometheus for kong. for example, the rule [gateway_alerts_mtp_mam-api_GW20200805120155391_3_alertType], i am sure that every 5 seconds, the expr value must be greater than 20. on this occasion, at first, i receive firing msg every 30 minutes(which i hope), after a days, i receive msg every 8 minutes!
i am not sure how resolve_timeout, group_wait, group_interval, repeat_interval work.
alertmanager version:
alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d)
build user: root@dee35927357f
build date: 20200617-08:54:02
go version: go1.14.4
prometheus version:
prometheus, version 2.23.0 (branch: HEAD, revision: 26d89b4b0776fe4cd5a3656dfa520f119a375273)
build user: root@37609b3a0a21
build date: 20201126-10:56:17
go version: go1.15.5
platform: linux/amd64
here prometheus config:
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
rule_files:
- "rules/alerts_*.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'kong-collector'
static_configs:
- targets: ['x.x.x.x:8001']
i am not sure how resolve_timeout, group_wait, group_interval, repeat_interval work.
group_wait
How long to wait to buffer alerts of the same group before sending a notification initially.
group_interval
How long to wait before sending an alert that has been added to a group for which there has already been a notification.
repeat_interval
How long to wait before re-sending a given alert that has already been sent in a notification.
Maybe we should sync the notifyAt
state across the cluster.
I have testing notify with alertmanager, the result is.
alertmanager.yml
group_wait: 5s
group_interval: 1m
repeat_interval: 2m
algorithm : total_notif_interval = group_interval + repeat_interval if start at 00:00:00
Notift times | time line | notify algorithm | * |
---|---|---|---|
1 (first alert) | 00:00:05 | total = group_wait | alert start at 00:00:00 |
2 (no alert add group) | 00:03:05 | total = group_interval + repeat_interval | |
3 (no alert add group) | 00:06:05 | total = group_interval + repeat_interval | |
4 (new alert add group) | 00:07:05 | total = group_interval | |
5 (new alert add group) | 00:10:05 | total = group_interval + repeat_interval |
What did you do? Created an alert and a route to Slack with the following config:
What did you expect to see? One alert every 24h for the same alert
What did you see instead? Under which circumstances? A repeated alert being sent after some amount of
group_interval
has passed (5m, 10m, 15m, etc). What I believe is happening is that a new data point appears after some amount of time< group_interval
, but it then has resolved itself after the full duration of thegroup_interval
. Additionally I have confirmed that there are no changes in the labels for the alert; I have viewed the query that triggers the alert and the labels stay consistent the whole time.For example, we have an alert that is triggered when a kubernetes pod is in the
state: Error
:pod: foo, state: Error
group_wait
pod: bar, state: Error
. An alert is not sent yet because thegroup_interval
of 5m has not passed yet.pod: bar
is no longer instate: Error
(note,pod: foo
is still instate: Error
pod: foo, state: Error
because thepod: bar
is no longer instate: Error
I would expect to see either:
pod: bar
Environment
System information:
Alertmanager version:
Prometheus version: