prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.58k stars 2.14k forks source link

Making repeat_interval work irrespective of data.retention #3473

Open grobinson-grafana opened 1 year ago

grobinson-grafana commented 1 year ago

I noticed that in the docs for repeat_interval that this interval is equivalent to min(repeat_interval, data.retention) because notification log entries that have not been updated in data.retention are garbage collected.

# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
# Note that this parameter is implicitly bound by Alertmanager's
# `--data.retention` configuration flag. Notifications will be resent after either
# repeat_interval or the data retention period have passed, whichever
# occurs first. `repeat_interval` should not be less than `group_interval`.
[ repeat_interval: <duration> | default = 4h ]

However, I don't understand why the ExpiresAt field for a notification log entry is not "refreshed" each time the aggregation group is flushed. This would prevent active alerts from having their notification logs deleted, and ensure the retention period only garbage collects unused data rather than data that is still in use.

I'm happy to open a PR for this work, but before I do is there a good reason why it works this way and why it should not be changed?

simonpasquier commented 1 year ago

One downside would be the additional gossip traffic on every group interval? But I imagine that Alertmanager could refresh the notification log when it approaches the (last update + data retention) mark...

grobinson-grafana commented 1 year ago

I'm thinking perhaps we don't need to reset ExpiresAt on each flush after all, and instead we can change how expiresAt is calculated to match silence.go#L557 https://github.com/prometheus/alertmanager/issues/1605#issuecomment-1217707514.

diff --git a/nflog/nflog.go b/nflog/nflog.go
index c318cede..04a68b70 100644
--- a/nflog/nflog.go
+++ b/nflog/nflog.go
@@ -389,10 +389,7 @@ func (l *Log) Log(r *pb.Receiver, gkey string, firingAlerts, resolvedAlerts []ui
                }
        }

-       expiresAt := now.Add(l.retention)
-       if expiry > 0 && l.retention > expiry {
-               expiresAt = now.Add(expiry)
-       }
+       expiresAt := now.Add(expiry).Add(l.retention)
grobinson-grafana commented 5 months ago

I'm thinking perhaps we don't need to reset ExpiresAt on each flush after all, and instead we can change how expiresAt is calculated to match silence.go#L557 #1605 (comment).

diff --git a/nflog/nflog.go b/nflog/nflog.go
index c318cede..04a68b70 100644
--- a/nflog/nflog.go
+++ b/nflog/nflog.go
@@ -389,10 +389,7 @@ func (l *Log) Log(r *pb.Receiver, gkey string, firingAlerts, resolvedAlerts []ui
                }
        }

-       expiresAt := now.Add(l.retention)
-       if expiry > 0 && l.retention > expiry {
-               expiresAt = now.Add(expiry)
-       }
+       expiresAt := now.Add(expiry).Add(l.retention)

Discussed this with @rwwiv and @santihernandezc and we realized there is an attack vector here where a bad actor could create routes with large repeat intervals and exhaust the resources of the Alertmanager (mostly memory). This might be mitigated somewhat by the limit on aggregation groups, but we need to do more work to understand it better.