grafana / grafana

The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.
https://grafana.com
GNU Affero General Public License v3.0
68.61k stars 12.71k forks source link

building alerting system for grafana #2209

Closed Dieterbe closed 8 years ago

Dieterbe commented 10 years ago

Hi everyone, I recently joined raintank and I will be working with @torkelo, @mattttt , and you, on alerting support for Grafana.

From the results of the Grafana User Survey it is obvious that alerting is the most commonly missed feature for Grafana. I have worked on/with a few alerting systems in the past (nagios, bosun, graph-explorer, etsy's kale stack, ...) and I'm excited about the opportunity in front of us: we can take the best of said systems, but combine them with Grafana's focus on a polished user experience, resulting in a powerful alerting system, well-integrated and smooth to work with.

First of all, terminology sync:

I want to spec out requirements, possible implementation ideas and their pro's/cons. With your feedback, we can adjust, refine and choose a specific direction.

General thoughts:

That said, we don't want to reinvent the wheel: we want alerting code and functionality to integrate well with Grafana, but if high-quality code is compatible, we should use it. In fact, I have a prototype that leverages some existing bosun code. (see "Current state")

The raintank/grafana version currently has an alerting package with a simple scheduler, an in-process worker bus as well as rabbitmq based, an alert executor and email notifications. It uses the bosun expression libraries which gives us the ability to evaluate arbitrarily complex expressions (use several metrics, use boolean logic, math, etc). This package is currently raintank-specific but we will merge a generic version of this into upstream grafana. This will provide an alert execution platform but notably still missing is

  1. an interface to create and manage alerting rules
  2. state management (acknowledgements etc)

these are harder problems, which I hope to tackle with your input.

Requirements, Future implementations

First off, I think bosun is a pretty fantastic system for alerting (not so much for visualization) You can make your alerting rules as advanced as you want, and it enables you to fine-tune over time, backtest on historical data, so you can get them just right. And it has a good state machine. In theory we could just compile bosun straight into grafana, and leverage bosun via its REST api instead of Golang api, but then we have less finegrained control and for now I feel more comfortable trying out piece by piece (piece meaning golang package) and make the integration decision on a case by case basis. Though the integration may look different down the road based on experience and as we figure out what we want our alerting to look like.

Either way, we don't just want great alerting. We want great alerting combined with great visualizations, notifications with context, and a smooth workflow where you can manage your alerts in the same place you manage your visualizations. So it needs to be nicely integrated into Grafana. To that end, there's a few things to consider:

  1. some visualized metrics (metrics plotted on graphs) are not alerted on
  2. some visualized metrics are alerted on:
    • A: with simple threshold checks: easy to visualize alerting logic
    • B: with more advanced logic: (e.g. look at standard deviation of the series being plotted, compare current median against historical median, etc): can't easily be visualized nex to the input series
  3. some metrics used in alerting logic are not to be vizualized

Basically, there's a bunch of stuff you may want visualized (V), and a bunch of stuff you want alerts (A), and V and A have some overlap. I need to think about this a bit more and wonder what y'all think. There will definitely need to be 1 central place where you can get an overview of all the things you're alerting on, irrespective of where those rules are defined.

There's a few more complications which I'll explain through an example sketch of how alerting could look like: sketch

let's say we have a timeseries for requests (A) and one for errorous requests (B) and this is what we want to plot. we then use fields C,D,E to put stuff that we don't want to alert on. C contains the formula for ratio of error requests against the total.

we may for example want to alert (see E) if the median of this ratio in the last 5min ago is more than 1.5 of what the ratio was in the same 5minute period last week, and also if the errors seen in the last 5min is worse than the errors seen since 2 months ago until 5min ago.

notes:

other ponderings:

I think for an initial implementation every graph could have two fields, like so:

warn: - expression
         - notification settings (email,http hook, ..)
crit: - expression
        -notification settings

where the expression is something like what I put in E in the sketch. for logic/data that we don't want to visualize, we just toggle off the visibility icon. grafana would replace the variables in the formula's, execute the expression (with the current bosun based executor). results (state changes) could be fed into something like elasticsearch and displayed via the annotations system.

Thoughts? Do you have concerns or needs that I didn't addres?

kokarn commented 9 years ago

For me number 2 seems better in that you can really minimize the number of things you have to configure for everything to work smoothly. In our current setup we would go with that.

Just so it has been said if everybody else disagrees ;)

Quick edit: Awesome slides. If the ending product comes out looking half as good as that then it's amazing.

aferrari-technisys commented 9 years ago

+1 I agree that internal notification handler with this integrations is perfect! for most common use case.

I'll be happy to be Beta Testing :) and slides be amazing!

mattttt commented 9 years ago

I think @Dieterbe's last post clears things up quite a bit, but I wanted to post this quick diagram to further clarify.

Alerting in Grafana is really two things, the self-service alert config (thanks @jaimegago, couldnt have said it better myself) and the handler itself.

We'll be shipping a Grafana alert handler, but also providing the framework to integrate with your alerting software of choice:

alerting-structure-layout

BarthV commented 9 years ago

+1 for building a kind of bridge to other alerting systems (maybe we could think about implementing some generic alerting plugin system :-) )

grobie commented 9 years ago

You could add Prometheus also on the "External Alert Handlers" part. The first Prometheus alertmanager version is in production in several companies and a complete rewrite is currently in the works. SoundCloud might use Grafana to configure alerts, but very certainly only if the Prometheus alertmanager is used as alert handler.

mattttt commented 9 years ago

@grobie, good catch, fixed in original comment.

jaimegago commented 9 years ago

@mattttt @Dieterbe that's great ! Looks like you're on the path of "batteries included but removable" which is IMHO the best of both worlds. Have you already thought about how you are going to pass authorization data to the alert handler? I'm thinking a story like this: As a Grafana user I'd like to be alerted via email and/or pagerduty and/or foo when (some condition built via Grafana alerting UI) happens. That user should only be able to send alerts to the notifications system he is authorized for, this is a requirement for self service and will need to be addressed someway somehow. Since Grafana 2 we have a SQL backend + users authentication/authorization with LDAP integration, so it doesn't seem too far fetch to have that capability from day one of alerting? With Sensu (which is the tool I'd be plugging in) passing the alert target (e.g. email address) via the handler should be quite straight forward, can't say about the others.

Crapworks commented 9 years ago

Hi all, I am happy to see that this afford is getting pushed forward, since I love the self-service alert configuration approach.

Personally, I don't need a specific alert handler. I would like to see a generic HTTP POST handler, that is just triggered as soon as an alert it thrown. I think most admins can quickly build something that is capable of accepting HTTP and then doing whatever they need to do with it (sending it to nagios, riemann, younameit). So I would be happy with an HTTP handler that sends all informations about an alert as JSON data.

Talking about alerting via grafana, are you planning to add something like flapping detection? Or is this something that the external monitoring system should take care of?

Keep up the good work guys!

Cheers

Dieterbe commented 9 years ago

I would like to see a generic HTTP POST handler, that is just triggered as soon as an alert it thrown. I think most admins can quickly build something that is capable of accepting HTTP and then doing whatever they need to do with it (sending it to nagios, riemann, younameit)

so if an alert fires (say "web123 has critical cpu idle!, value 1 lower than threshold 15" ) and we do an http post of that data, how would you handle that in nagios? you mean nagios would take it in as a passive service check, and then nagios would send the notifications?

Talking about alerting via grafana, are you planning to add something like flapping detection? Or is this something that the external monitoring system should take care of?

this is also something we need to think about more. This can get messy and if people use something like pagerduty or flapjack then they can use that to aggregate events/surpress duplicates so we're looking if we can avoid implementing that in the grafana handler, though we may have to. Note also that because you'll be able to set alerts on arbitrary metrics query expressions, you'll have a lot of power to take past data into account in the actual expression, and so you can create a more robust signal in the expression that doesn't change state as often.

Crapworks commented 9 years ago

so if an alert fires (say "web123 has critical cpu idle!, value 1 lower than threshold 15" ) and we do an >http post of that data, how would you handle that in nagios? you mean nagios would take it in as a >passive service check, and then nagios would send the notifications?

Kind of. I'm actually looking forward to grafana alerting to get rid of nagios. Using the HTTP handler, you need to configure passive checks for nagios to be able to submit the results there. But I would like to have grafana as the one source where you can configure alerting. In our case, the people who are allowed to add alerts are the actual sysadmin who would also configure the checks in nagios.

With the http handler grafana would have everything we need for that: a dashboard for real time monitoring, an API, easy alerting configuration and a http handler where we can forward alerts to our internal notifcation system.

Cheers

csadai commented 9 years ago

Although I can see the logic in this integration strategy, I cannot help but think it's a little bit overkill. To what I understand (and what I could read in the thread), the only reason most Grafana users keep using a standalone alerting technology is that Grafana doesn't propose one. So, wouldn't be more "lean" to focus first on the Grafana Alerting part, and develop it as a component that communicates with the rest of the stack through the API, so that other contributors would be able to mimic the behaviour and create specific adapters later?

tl;dr: by focusing on building its own "batteries" first, Grafana would have a full-featured alerting system, that can later evolve into a service for integration with 3rd party alerting tools.

j1n6 commented 9 years ago

Minor concern, if this hasn't been addressed. The traditional alerting system does not scale well for cloud infrastructure, because resources are very dynamic (provisioned & destroyed). Alerting on metrics should support tempting or grouping feature (with exceptions override, sometimes workloads are different). A templated or grouped alert should be able to discover new group set.

dahendel commented 9 years ago

Thanks for the update! In my use case built-in alerting in Grafana is all I would need at this time. I have been patiently impatiently waiting for Grafana alerting.

simmel commented 9 years ago

As I promised in IRC, here is our use case for this:

We have a legacy Rails-app which searches our logs for patterns and has an HTTP API to answer if a certain pattern has crossed it's thresholds and thus has a status of {OK,WARNING,CRITICAL}.

Threshold can be either:

If I understand this feature correctly, Grafana will support this usage pattern (via Logstash and Elasticsearch of course) when this feature and the Elasticsearch data source is fully implemented?

mgravlin commented 9 years ago

@Dieterbe @mattttt your slides and mock-ups look absolutely amazing! This is truly a game changer. To me the internal Grafana alert handler would suit our needs the best. Reasons:

A few concerns:

sknolin commented 9 years ago

I think it's ideal to be able to integrate alerting into existing alert systems. There are some tough and ugly problems like flap detection as mentioned that have been dealt with and it seems wasteful to reinvent everything from the start. I'd hate to see this buried under the weight of feature creep.

But I don't think this really needs to be a tight integration into all of these alert handlers. A good, well documented api should allow people familiar with these system to integrate with little effort. So the slide with 'grafana api -> handler' is what looks attractive to me.

Scott

jds86930 commented 9 years ago

Hi all -- I'm coming in late to this discussion, but I have some expertise on this topic, and I'm the lead developer of one of the tools that has attempted to solve the alerting problem. Our tool, StatsAgg is comparable to programs like bosun. StatsAgg aims to cover flexible alerting, alert suspensions, and notifications & is pretty mature/stable at this point (although the API is not ready).

Anyway, some thoughts on the subject of alerting:

Those are the main things that I wanted to mention. There are lots of other important aspects to alerting too (notifications, suspensions, etc), but those solutions are easy to bolt on once one has the architecture of core problem solved. I realize that the scale of our needs are not the same as the average user, but hopefully you all can appreciate this perspective.

blysik commented 9 years ago

I'd like to suggest launching with a notification handler that can send data to Alerta: https://github.com/guardian/alerta

Alerta has a very sane REST API for receiving notifications.

kirbmeister commented 9 years ago

I prefer a lean grafana only implementation! I think its worth re-evaluating after everyone has had experience with this feature in the typical fantastic Grafana UX.

007reader commented 9 years ago

There will be many complex cases and/or custom backend systems people will want to integrate with. You can see many on this thread, mostly open source, but there are so many commercial products out there as well ! Don't bother with individual handlers - it will be a rat whole and you always will be in a catching mode

I'd strongly advise to implement only two types of handlers. One is definitely HTTP POST, it will be the most versatile and flexible tool. The other one is custom scripting, so the users can implement integration with their specific tool of choice. Plugin model is not bad, but it forces to use specific plugin language which is limiting. External scripts are better - as long as you pass to a script all the details the script can be written in any language - shell script, Python, etc.

shanielh commented 9 years ago

I'm with @007reader

j1n6 commented 9 years ago

I agree. As long as common integration methods provided, custom implementation can be separate project or deployment.

For example, the recent CloudWatch release is decent, however I would love to make it as a separate project by only synchronous selected metrics to alternative storage. It will give us full retention instead of only 2 weeks data.

Dieterbe commented 9 years ago

hey everyone, my grafanacon presentation video is online! it's at https://www.youtube.com/watch?v=C_H2ew8e5OM i think it will clear up a lot, though as you can see. the specifics of integrations are still to be figured out and was also a topic a lot of people wanted to discuss. (though there was limited time and i asked people to continue the conversation here so everybody can participate)

@simmel yes exactly. you'd use an ES query and set a rule on that. @activars re grouping and discovery, i think a lot of that will depend on your datasource, but most common requirements should be addressed if you use something like graphite or ES who i know are very good at "auto discovering" previously unseen metrics/series/documents that match the given expression (with wildcards) for graphite or query (for ES). not sure about the other sources. your comment about needing to apply exceptions to rules is a trickier one, we'll probably need to address that at some point but I think that can wait until things are clearer and more settled down. maybe we can avoid needing that somehow. @mgravlin frequency will be a setting in the rule, dealing with too slow datasources, not sure yet. but don't do it ;-) . also handler dependent. scale-out deployment should be possible, also with included handler so pretty sure we'll look at that. but maybe not a priority for first release. yes, notification plugins will be key and we'll make sure you can use what you want/need. re api: yes :) @sknolin if i understand correctly, in your view, grafana would do the alert scheduling, execution, trigger notification hooks etc, even when using another system like nagios/bosun. then what exactly would be the role of the external system (nagios/bosun/...) . this seems also similar to what @Crapworks was talking about. @jds86930 StatsAgg looks quite interesting. I think here also an integration with grafana would make sense. I think stream processing is a valid approach that has a place as an alternative to repeated querying. but the latter is just simpler to get started with and just simpler in general I think. But both should be supported. So yes in grafana you will be able to set up patterns/queries that match a spectrum of data, and potentially cover new series/data as it becomes live. in our view though, you would just leverage whatever functionality your datasource has (for example graphite is pretty good at this with its wildcards, glob expressions etc, and elasticsearch rich data and query model), or if somebody would use grafana+StatsAgg they could just use StatsAgg to solve that. Are you saying grafana itself should do anything specific here? I think if your datasource is not fast enough, solve the datasource problem. get something faster, that has caching for metric metadata, maybe a memory server in front or stream processing. but either way as far as Grafana is concerned, not much would change that I can think of? @blysik yes looks interesting. so many alerting tools out there we should integrate with :) to be clear, do you like the idea of managing alerting rules and visualizing them in grafana the way it's been presented so far, but you want to use alerta to take care of the notifications? would alerta be the primary place where you go look at the state of your alerts (that seems like a reasonable thing to do), but want to make sure i fully understand how you see the integration.

@007reader , @shanielh , @activars just to be clear, this integration via a generic HTTP post or script, what would be the goal. to tell the external system "there's a new rule, here's the query, the thresholds, frequency, etc. now go please execute it" ? or would grafana be the thing executing the rules and then updating the external systems with new state?

blysik commented 9 years ago

@blysik yes looks interesting. so many alerting tools out there we should integrate with :) to be clear, do you like the idea of managing alerting rules and visualizing them in grafana the way it's been presented so far, but you want to use alerta to take care of the notifications? would alerta be the primary place where you go look at the state of your alerts (that seems like a reasonable thing to do), but want to make sure i fully understand how you see the integration.

Correct. Alerta is shaping up to be our notifications hub. All sorts of things sending alerts into it. For example: Custom scripts, Cabot, Zenoss, vCenter, and maybe Grafana. That gives ops a single place to see all the alerts. And then that is the single place which drives notifications to the oncall engineer.

sknolin commented 9 years ago

@sknolin https://github.com/sknolin if i understand correctly, in your view, grafana would do the alert scheduling, execution, trigger notification hooks etc, even when using another system like nagios/bosun. then what exactly would be the role of the external system (nagios/bosun/...) . this seems also similar to what @Crapworks https://github.com/Crapworks was talking about.

I guess I didn't explain well. That's not what I'd want, grafana not doing all that stuff. @Crapworks (that's fun to type) is talking passive service checks, I'd just use active polling.

So all I would want is an api where I can read grafana alerts status. External systems do everything else.

That doesn't mean if it didn't develop somehow into a great general alerting tool I wouldn't use it. Just what I'd do now.

Scott

Dieterbe commented 9 years ago

@sknolin

So all I would want is an api where I can read grafana alerts status.

how would that status be updated in grafana? what process would be executing alerts and updating the status in grafana?

sknolin commented 9 years ago

Each time it's polled grafana updates alert status, with some kind of caching interval to deal with multiple systems polling it.

sknolin commented 9 years ago

I see the point that this still requires grafana to do logic for the alerts and present it. So thinking about it, no I don't need alerts of any kind.

I think I could do whatever alerting needed if I were able to retrieve the current value for a metric on a graph panel. For example where we derive a rate from the sum of several counter metrics and graph it, would be nice to poll for the current value with the monitoring system. Maybe that's totally doable now and I'm just being obtuse.

Scott

shanielh commented 9 years ago

@Dieterbe The latter :

grafana be the thing executing the rules and then updating the external systems with new state

jds86930 commented 9 years ago

@Dieterbe I agree that polling the datasource (Graphite, OpenTSDB, etc) using the datasource's native query syntax is simplest/easiest & is probably the quickest way to get some form of alerting natively into Grafana. For a lot of people, this sort of solution will meet their needs, and I think this is the best solution for the initial Grafana implementation (in my opinion). My main point was that there is a ceiling on alert configurability & performance that will be hard to work past with the 'polling the datasource' model.

In terms of directions Grafana could go for long-term alerting solutions, I could see a few options:

Crapworks commented 9 years ago

Hi all,

@Dieterbe I just watched your presentation and the stuff looks awesome. I really appreciate that you are trying to build an alerting system in the "right" way, using a lot of community input! Thanks!

I also want to make my point a bit clearer, since I don't think it was obvious what I was trying to say. I DO NOT require grafana to be aware of any other monitoring system as Nagios, Icinga, Bosun, etc. I actually only need this:

The event-flow I am thinking of:

Like @mgravlin and @007reader mentioned, most notification and alerting services use HTTP POST, requiering diffent kinds of data. So the most generic thing I could think of, is let the user define their POST data and headers, so you can feed several systems with one handler, using different templates. Pseudo code example:

"notificator": {
    "httppost": {
        "data": {
            "host": "$hostname",
            "alert": "$alertname",
            "state": "$state"
        },
        "header": {
            "content-type": "application/json"
        }
    }
}

Giving the user enough variables to use here, will be powerfull enough to feed a ton of backends.

And again, with that kind of handler, any sysadmin with some coding knowledge can build it's own http post receiver and transform how he likes to, for example, feed backends that doesn't understand http post.

Since this is stateless, it also scales out. Just place a load balancer in front of the backend/api/whatever and you are good to go.

At least, this would solve most/nearly all of my problems ;)

Cheers

javiermarcon commented 9 years ago

Thanks for building this feature. Is there an aproximate release date for it?

kokarn commented 9 years ago

torkelo said ROUGHLY 3 months on IRC. If I understood him correctly that's a really rough estimate and should be treated as such.

doanerock commented 9 years ago

I am excited for the ability to do alerting with grafana. I think that this is the one feature that is keeping grafana from being the ultimate monitoring tool.

j1n6 commented 9 years ago

If you have an early alpha/beta release, I would love to test & give early feedback with production data.

anryko commented 9 years ago

++

lebernardo commented 9 years ago

Me 2 lol

+1

Em seg, 16 de nov de 2015 às 21:03, Jing Dong notifications@github.com escreveu:

If you have a early alpha/beta release, I would love to test & give early feedback with production data.

— Reply to this email directly or view it on GitHub https://github.com/grafana/grafana/issues/2209#issuecomment-157202686.

chaosong commented 9 years ago

+1 If you have an early alpha/beta release, I would love to test & give early feedback with production data.

lebernardo commented 9 years ago

+1 me 2

2015-11-21 14:44 GMT-02:00 chaosong notifications@github.com:

+1 If you have an early alpha/beta release, I would love to test & give early feedback with production data.

— Reply to this email directly or view it on GitHub https://github.com/grafana/grafana/issues/2209#issuecomment-158661279.

yuankui commented 9 years ago

+1

Dieterbe commented 9 years ago

great to see all the +1's but FWIW they're not really needed. we already know it's the most eagerly awaited new feature, and once we have tangible progress, the code will appear in a separate branch that anyone can play with. BTW we're also bringing in more people to work full-time on grafana so stay tuned everyone :)

vchekan commented 9 years ago

Yes please, there are 484 people "watching" this issue. Each time you "+1" it, 484 people get email notification. Just subscribe to notification and it'll be indication of your interest to the issue.

simmel commented 9 years ago

+1 >; P

On Mon, 2015-11-30 at 10:33:52 -0800, Vadim Chekan wrote:

Yes please, there are 484 people "watching" this issue. Each time you "+1" it, 484 people get email notification.

gsaray101 commented 9 years ago

Sory, I know you guys are working very hard on this. Is there any time line for first release?

anlutro commented 9 years ago

I would be more than happy with being able to configure alerting metrics and thresholds (either through the web interface or API) and a Grafana cronjob/daemon that checks these metrics and does a HTTP POST with JSON or invokes a script with JSON in stdout. It would be extremely simple for individuals to write a simple python script that passes this information on to Pagerduty, Slack, IRC, SMS, email or whatever.

While I would highly appreciate the convenience, I don't think it's Grafana's job to tightly integrate with third-party tools, and would rather see a minimalist implemetation sooner than a well fleshed out one later.

bemeyert commented 9 years ago

I completely agree with @anlutro . Please provide something simple to get started. The most interesting thing for me is to enable the people to set simple alerts themselves (self-service). Grafana shouldn't try to replace existing alerting/escalating solutions.

elvarb commented 9 years ago

I agree with @anlutro as well. But instead of just providing a simple api rather have the alerting part able to handle custom plugins that interact with the api. That way the base package could include email, pagerduty and a few others then the community could add to it as needed. Similar to how Logstash plugins are handled now.

MachineCity commented 9 years ago

+1

mrh666 commented 9 years ago

Any news on alerting system? Any estimates?

amitgilad3 commented 9 years ago

+1