Open TimSimmons opened 4 years ago
Adding a route that matches every alert for auditing means that if you somehow have an alert that doesn't have a matching route below it, it won't fall back to the root route.
If that's the case then that's a bug with continue
, can you reproduce this?
perhaps a bug in my understanding, this is what I was doing https://gist.github.com/TimSimmons/420e37b003447d63da01dc20aecb2837 that makes a fair few assumptions, but I thought the root route only got a notification if there wasn't a match in a child route.
That might just be a bug in the viewer, have you seen the same with the alertmanager?
not with something as simple as ^^ but in my current setup only alerts that don't match a route go to the receiver on the root route.
Hmm, I'm not seeing a unittest in route_test.go to cover this so a bug is possible.
Assuming that is a bug, does that change things wrt this request? I still think it would be awesome to have sent notifications logged with all the data in one place!
Adding a route that matches every alert for auditing means that if you somehow have an alert that doesn't have a matching route below it, it won't fall back to the root route.
If that's the case then that's a bug with
continue
, can you reproduce this?
I see no bug here. If we match a child, even if the child has continue:true, we should not activate parent.
I think the original request is valid and should be done like we do prometheus query logging.
If we match a child, even if the child has continue:true, we should not activate parent.
Hmm, a careful reading of the docs appears to agree.
This isn't the first time that the ability to log alerts or notifications has been requested (see https://github.com/prometheus/alertmanager/issues/1042 for alerts). I see that using webhook + continue: true
as we've recommended so far doesn't work in all cases as @TimSimmons described.
This is logging notifications, not alerts. Webhook can not tell you if other webhooks calls succeeds.
Sorry I missed that issue, not sure how I managed that π . I want to make sure I'm clear that while I do want to log notifications because that is the definitive action that alertmanager takes, I do want to log some metadata about the alerts therein at the same time. The common labels between the alerts in the notification at the very least. Some of the additional info would be nice, but I'm not sure how you would do that usefully without logging everything which is probably overboard.
Would anyone object if I took a swing at implementing this, and then the community could decide if it was π or π ?
Any thoughts on this? This is high value IMO, and something that would have been very useful to us over years of running AM.
@TimSimmons feel free to submit something
@TimSimmons Any chance that you'd pick up this ?
π unlikely that I'd take it on at this moment, sorry to say I would and then absolutely not do so π
We're also increasingly interested in this. I envisaged it as a designated webhook receiver that is sent a copy of all successful notifications in their equivalent webhook format. I like @roidelapluie's point about equivalence with query logging, so the same JSON could be appended to a local file instead.
Given a copy of Alertmanager's config file, the reader could determine the type of the original receiver. However, there would be a brief window for inconsistency when Alertmanager reloaded. To mitigate this, the payload could contain the information in an extra field, e.g.
type AuditMessage struct {
// Notification is a copy of the original notification in webhook format,
// using the original receiver name.
Notification *webhook.Message `json:"notification"`
// Integration provides the type of the receiver the message was sent to,
// e.g. webhook, pagerduty, email.
Integration string `json:"integration"`
}
The extreme version of this is including the entire receiver config stanza, but that would leak secrets. Any thoughts?
Iβm managing a large number of Alertmanager instances across various clusters. All their logs (along with logs of other containers) are aggregated in a centralised service. The question at hand: are alerts actually being generated?
Would this feature requested in this issue facilitate an easier way to ascertain if instances are firing the expected alerts? Or is there a distinction between what this issue proposes and what Iβm trying to achieve?
Perhaps there's already some feature that achieves this that I'm unaware of?
@matthewchivers This is about logging notifications (the output of Alertmanager) rather than alerts (the input). Many alerts from many Prometheis could result in one notification, so you'd probably be better off using the ALERTS
metric in Prometheus to see what fired.
tl;dr: I think it would be awesome if Alertmanager logged a rich/contextual message whenever it sent a notification that could be used for auditing/downstream analytics/etc. The current best practice of sending every alert to a webhook receiver would work but has some drawbacks that I think are enough to warrant adding some additional logging.
My Wish :genie:
I run a centralized Alertmanager cluster that sends alerts to hundreds of different destinations/teams/people. I would like to have a way of tracking how many alerts are sent to each destination, integration, what the metadata is for each alert so that we can track it over time.
The Current Best Practice
Reading docs and #560, it seems like the generally agreed/used practice here is to set up a route that matches all alerts right below the root route and send a webhook to something that does what you're looking for.
This would work fairly well for tracking alerts that have good metadata in labels about where they're going, but there are a few things about the approach that I'm not super excited about:
I'll definitely acknowledge here that we are not using alertmanager in the way that it should be. If our routing tree was a lot cleaner, and the labels on all of our alerts were more consistent, the webhook approach would be less problematic. But the path of least resistance for me (and hopefully to the benefit of the community) is to see if I can make the change in alertmanager, organizations/people are difficult :)
Proposal
I took a quick look at the code. My initial thought was:
Integration
orNotifier
interface to expose some rich metadata about the notification destination that could be logged at notify timeCommonLabels
orGroupLabels
from one of the sent alertsI'd be willing to take on the work. But I'll obviously defer to the expertise of the folks here on if this is even a reasonable thing to do, and if so on the approach.
Thank you so much for taking the time to read my request, thank you for all you do maintaining Prometheus and Alertmanager!