prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.64k stars 2.15k forks source link

Split retention in 2 parts or multiple nflogs #2961

Open roidelapluie opened 2 years ago

roidelapluie commented 2 years ago

We have a default of 120h retention.

While this default seems fine for the silences, it seems a lot too high for the nflog. Indeed, the nflog should ideally be kept only for ~110% of x=max( group_wait, group_interval, repeat_interval). When having a large number of alerts and a low x, alertmanager un-necessarily uses a lot of memory, because the state is broadcasted perpetually.

Here a heap of such a case: https://share.polarsignals.com/73d955e/

I see multiple ways forward:

roidelapluie commented 2 years ago

Maybe we can write a new nflog where we remove the duplicates as well as part of some garbage collection process.