prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.66k stars 2.16k forks source link

[feature request] Log each successfully sent notification with metadata #2304

Open TimSimmons opened 4 years ago

TimSimmons commented 4 years ago

tl;dr: I think it would be awesome if Alertmanager logged a rich/contextual message whenever it sent a notification that could be used for auditing/downstream analytics/etc. The current best practice of sending every alert to a webhook receiver would work but has some drawbacks that I think are enough to warrant adding some additional logging.

My Wish :genie:

I run a centralized Alertmanager cluster that sends alerts to hundreds of different destinations/teams/people. I would like to have a way of tracking how many alerts are sent to each destination, integration, what the metadata is for each alert so that we can track it over time.

The Current Best Practice

Reading docs and #560, it seems like the generally agreed/used practice here is to set up a route that matches all alerts right below the root route and send a webhook to something that does what you're looking for.

This would work fairly well for tracking alerts that have good metadata in labels about where they're going, but there are a few things about the approach that I'm not super excited about:

  1. You wouldn't have any way of knowing where the actual alert will be delivered (which integration/channel/etc). So counting things by actual destination (in the unfortunate world where many routes lead to similar destinations) isn't possible. It is possible to count notifications delivered by integration in metrics, but correlating that with inconsistent alert labels would be difficult.
  2. Adding a route that matches every alert for auditing means that if you somehow have an alert that doesn't have a matching route below it, it won't fall back to the root route. That's actually what I use the root route for now, catching bad user-defined alerts.
  3. (this is probably not a real issue) It adds a non-zero amount of overhead/work for alertmanager and the cluster to track and execute, sending webhooks. They're computers to it probably isn't a big deal, but definitely extra work being done.

I'll definitely acknowledge here that we are not using alertmanager in the way that it should be. If our routing tree was a lot cleaner, and the labels on all of our alerts were more consistent, the webhook approach would be less problematic. But the path of least resistance for me (and hopefully to the benefit of the community) is to see if I can make the change in alertmanager, organizations/people are difficult :)

Proposal

I took a quick look at the code. My initial thought was:

I'd be willing to take on the work. But I'll obviously defer to the expertise of the folks here on if this is even a reasonable thing to do, and if so on the approach.

Thank you so much for taking the time to read my request, thank you for all you do maintaining Prometheus and Alertmanager!

brian-brazil commented 4 years ago

Adding a route that matches every alert for auditing means that if you somehow have an alert that doesn't have a matching route below it, it won't fall back to the root route.

If that's the case then that's a bug with continue, can you reproduce this?

TimSimmons commented 4 years ago

perhaps a bug in my understanding, this is what I was doing https://gist.github.com/TimSimmons/420e37b003447d63da01dc20aecb2837 that makes a fair few assumptions, but I thought the root route only got a notification if there wasn't a match in a child route.

brian-brazil commented 4 years ago

That might just be a bug in the viewer, have you seen the same with the alertmanager?

TimSimmons commented 4 years ago

not with something as simple as ^^ but in my current setup only alerts that don't match a route go to the receiver on the root route.

brian-brazil commented 4 years ago

Hmm, I'm not seeing a unittest in route_test.go to cover this so a bug is possible.

TimSimmons commented 4 years ago

Assuming that is a bug, does that change things wrt this request? I still think it would be awesome to have sent notifications logged with all the data in one place!

roidelapluie commented 4 years ago

Adding a route that matches every alert for auditing means that if you somehow have an alert that doesn't have a matching route below it, it won't fall back to the root route.

If that's the case then that's a bug with continue, can you reproduce this?

I see no bug here. If we match a child, even if the child has continue:true, we should not activate parent.

roidelapluie commented 4 years ago

I think the original request is valid and should be done like we do prometheus query logging.

brian-brazil commented 4 years ago

If we match a child, even if the child has continue:true, we should not activate parent.

Hmm, a careful reading of the docs appears to agree.

simonpasquier commented 4 years ago

This isn't the first time that the ability to log alerts or notifications has been requested (see https://github.com/prometheus/alertmanager/issues/1042 for alerts). I see that using webhook + continue: true as we've recommended so far doesn't work in all cases as @TimSimmons described.

roidelapluie commented 4 years ago

This is logging notifications, not alerts. Webhook can not tell you if other webhooks calls succeeds.

TimSimmons commented 4 years ago

Sorry I missed that issue, not sure how I managed that 😞 . I want to make sure I'm clear that while I do want to log notifications because that is the definitive action that alertmanager takes, I do want to log some metadata about the alerts therein at the same time. The common labels between the alerts in the notification at the very least. Some of the additional info would be nice, but I'm not sure how you would do that usefully without logging everything which is probably overboard.

TimSimmons commented 4 years ago

Would anyone object if I took a swing at implementing this, and then the community could decide if it was πŸ‘ or πŸ‘Ž ?

jnadler commented 4 years ago

Any thoughts on this? This is high value IMO, and something that would have been very useful to us over years of running AM.

simonpasquier commented 4 years ago

@TimSimmons feel free to submit something

varac commented 2 years ago

@TimSimmons Any chance that you'd pick up this ?

TimSimmons commented 2 years ago

πŸ‘‹ unlikely that I'd take it on at this moment, sorry to say I would and then absolutely not do so πŸ˜“

gebn commented 1 year ago

We're also increasingly interested in this. I envisaged it as a designated webhook receiver that is sent a copy of all successful notifications in their equivalent webhook format. I like @roidelapluie's point about equivalence with query logging, so the same JSON could be appended to a local file instead.

Given a copy of Alertmanager's config file, the reader could determine the type of the original receiver. However, there would be a brief window for inconsistency when Alertmanager reloaded. To mitigate this, the payload could contain the information in an extra field, e.g.

type AuditMessage struct {
    // Notification is a copy of the original notification in webhook format,
    // using the original receiver name.
    Notification *webhook.Message `json:"notification"`

    // Integration provides the type of the receiver the message was sent to,
    // e.g. webhook, pagerduty, email.
    Integration string `json:"integration"`
}

The extreme version of this is including the entire receiver config stanza, but that would leak secrets. Any thoughts?

matthewchivers commented 1 year ago

I’m managing a large number of Alertmanager instances across various clusters. All their logs (along with logs of other containers) are aggregated in a centralised service. The question at hand: are alerts actually being generated?

Would this feature requested in this issue facilitate an easier way to ascertain if instances are firing the expected alerts? Or is there a distinction between what this issue proposes and what I’m trying to achieve?

Perhaps there's already some feature that achieves this that I'm unaware of?

gebn commented 1 year ago

@matthewchivers This is about logging notifications (the output of Alertmanager) rather than alerts (the input). Many alerts from many Prometheis could result in one notification, so you'd probably be better off using the ALERTS metric in Prometheus to see what fired.