prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.68k stars 2.16k forks source link

GroupWait and Interval 0 causes a lot of allocs and logs #1114

Closed gouthamve closed 6 years ago

gouthamve commented 7 years ago

What did you do? Ran a single AM instance with GroupWait and Interval 0 and debug logging.

What did you expect to see? Chatty but useful logs.

What did you see instead? Under which circumstances? A huge flood!

receivers:


* Prometheus configuration file:

insert configuration here (if relevant to the issue)


* Logs:

level=debug ts=2017-11-20T17:12:00.225274251Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.225427596Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.225531945Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.225607822Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.225677089Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.225736667Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.225795896Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.225871146Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.225968086Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.226105244Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.22625187Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.226384052Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.226514799Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.226650737Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.226782719Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.226873043Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.226945495Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.227016107Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.22708717Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.22715533Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.22722366Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.227304341Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.227421041Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.227556633Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.227703011Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.227852496Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.227989343Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.228120466Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]] level=debug ts=2017-11-20T17:12:00.228249764Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}/{severity=\"deadman\"}:{alertname=\"DeadManBoy\"}" msg=Flushing alerts=[DeadManBoy[8ba1e9d][active]]



This is a super small subset.
gouthamve commented 7 years ago

Sorry. Config issue on my end. I put both group_wait and group_interval as 0.

Not sure if that should cause a barrage of notifications though. We can atleast protect this case by putting a check.

Feel free to close if this doesn't make sense.

gouthamve commented 7 years ago

So notifications were being sent every repeat_interval only. But this loop being run so tightly is causing a lot of allocs: rate(go_memstats_alloc_bytes_total) = 60M

fabxc commented 6 years ago

People usually set too low intervals. But 0 is just 100% wrong semantically. High resource usage is more than expected in this case.

We should probably return a configuration error in this case.

simonpasquier commented 6 years ago

@fabxc IIUC this is similar to #583.

In the comments, another proposal was to have a sane minimum value (eg 1s) in case the parameter was zero. Personally I'd prefer a hard failure (explicit over implicit) but wanting to check.

fabxc commented 6 years ago

Yes, a hard failure is probably better so that there are no wrong expectations.

simonpasquier commented 6 years ago

Thanks! I'll give it a shot then

puneets-ampere commented 9 months ago

if i am interested only in repeat_interval , and do not need group_interval, how do i express this via configurations? maybe group_interval=0 could have helped in this scenario?