Open grobinson-grafana opened 1 year ago
One downside would be the additional gossip traffic on every group interval? But I imagine that Alertmanager could refresh the notification log when it approaches the (last update + data retention) mark...
I'm thinking perhaps we don't need to reset ExpiresAt
on each flush after all, and instead we can change how expiresAt
is calculated to match silence.go#L557
https://github.com/prometheus/alertmanager/issues/1605#issuecomment-1217707514.
diff --git a/nflog/nflog.go b/nflog/nflog.go
index c318cede..04a68b70 100644
--- a/nflog/nflog.go
+++ b/nflog/nflog.go
@@ -389,10 +389,7 @@ func (l *Log) Log(r *pb.Receiver, gkey string, firingAlerts, resolvedAlerts []ui
}
}
- expiresAt := now.Add(l.retention)
- if expiry > 0 && l.retention > expiry {
- expiresAt = now.Add(expiry)
- }
+ expiresAt := now.Add(expiry).Add(l.retention)
I'm thinking perhaps we don't need to reset
ExpiresAt
on each flush after all, and instead we can change howexpiresAt
is calculated to matchsilence.go#L557
#1605 (comment).diff --git a/nflog/nflog.go b/nflog/nflog.go index c318cede..04a68b70 100644 --- a/nflog/nflog.go +++ b/nflog/nflog.go @@ -389,10 +389,7 @@ func (l *Log) Log(r *pb.Receiver, gkey string, firingAlerts, resolvedAlerts []ui } } - expiresAt := now.Add(l.retention) - if expiry > 0 && l.retention > expiry { - expiresAt = now.Add(expiry) - } + expiresAt := now.Add(expiry).Add(l.retention)
Discussed this with @rwwiv and @santihernandezc and we realized there is an attack vector here where a bad actor could create routes with large repeat intervals and exhaust the resources of the Alertmanager (mostly memory). This might be mitigated somewhat by the limit on aggregation groups, but we need to do more work to understand it better.
I noticed that in the docs for
repeat_interval
that this interval is equivalent tomin(repeat_interval, data.retention)
because notification log entries that have not been updated indata.retention
are garbage collected.However, I don't understand why the
ExpiresAt
field for a notification log entry is not "refreshed" each time the aggregation group is flushed. This would prevent active alerts from having their notification logs deleted, and ensure the retention period only garbage collects unused data rather than data that is still in use.I'm happy to open a PR for this work, but before I do is there a good reason why it works this way and why it should not be changed?