grafana / grafana

The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.
https://grafana.com
GNU Affero General Public License v3.0
67.99k stars 12.62k forks source link

building alerting system for grafana #2209

Closed Dieterbe closed 8 years ago

Dieterbe commented 9 years ago

Hi everyone, I recently joined raintank and I will be working with @torkelo, @mattttt , and you, on alerting support for Grafana.

From the results of the Grafana User Survey it is obvious that alerting is the most commonly missed feature for Grafana. I have worked on/with a few alerting systems in the past (nagios, bosun, graph-explorer, etsy's kale stack, ...) and I'm excited about the opportunity in front of us: we can take the best of said systems, but combine them with Grafana's focus on a polished user experience, resulting in a powerful alerting system, well-integrated and smooth to work with.

First of all, terminology sync:

I want to spec out requirements, possible implementation ideas and their pro's/cons. With your feedback, we can adjust, refine and choose a specific direction.

General thoughts:

That said, we don't want to reinvent the wheel: we want alerting code and functionality to integrate well with Grafana, but if high-quality code is compatible, we should use it. In fact, I have a prototype that leverages some existing bosun code. (see "Current state")

The raintank/grafana version currently has an alerting package with a simple scheduler, an in-process worker bus as well as rabbitmq based, an alert executor and email notifications. It uses the bosun expression libraries which gives us the ability to evaluate arbitrarily complex expressions (use several metrics, use boolean logic, math, etc). This package is currently raintank-specific but we will merge a generic version of this into upstream grafana. This will provide an alert execution platform but notably still missing is

  1. an interface to create and manage alerting rules
  2. state management (acknowledgements etc)

these are harder problems, which I hope to tackle with your input.

Requirements, Future implementations

First off, I think bosun is a pretty fantastic system for alerting (not so much for visualization) You can make your alerting rules as advanced as you want, and it enables you to fine-tune over time, backtest on historical data, so you can get them just right. And it has a good state machine. In theory we could just compile bosun straight into grafana, and leverage bosun via its REST api instead of Golang api, but then we have less finegrained control and for now I feel more comfortable trying out piece by piece (piece meaning golang package) and make the integration decision on a case by case basis. Though the integration may look different down the road based on experience and as we figure out what we want our alerting to look like.

Either way, we don't just want great alerting. We want great alerting combined with great visualizations, notifications with context, and a smooth workflow where you can manage your alerts in the same place you manage your visualizations. So it needs to be nicely integrated into Grafana. To that end, there's a few things to consider:

  1. some visualized metrics (metrics plotted on graphs) are not alerted on
  2. some visualized metrics are alerted on:
    • A: with simple threshold checks: easy to visualize alerting logic
    • B: with more advanced logic: (e.g. look at standard deviation of the series being plotted, compare current median against historical median, etc): can't easily be visualized nex to the input series
  3. some metrics used in alerting logic are not to be vizualized

Basically, there's a bunch of stuff you may want visualized (V), and a bunch of stuff you want alerts (A), and V and A have some overlap. I need to think about this a bit more and wonder what y'all think. There will definitely need to be 1 central place where you can get an overview of all the things you're alerting on, irrespective of where those rules are defined.

There's a few more complications which I'll explain through an example sketch of how alerting could look like: sketch

let's say we have a timeseries for requests (A) and one for errorous requests (B) and this is what we want to plot. we then use fields C,D,E to put stuff that we don't want to alert on. C contains the formula for ratio of error requests against the total.

we may for example want to alert (see E) if the median of this ratio in the last 5min ago is more than 1.5 of what the ratio was in the same 5minute period last week, and also if the errors seen in the last 5min is worse than the errors seen since 2 months ago until 5min ago.

notes:

other ponderings:

I think for an initial implementation every graph could have two fields, like so:

warn: - expression
         - notification settings (email,http hook, ..)
crit: - expression
        -notification settings

where the expression is something like what I put in E in the sketch. for logic/data that we don't want to visualize, we just toggle off the visibility icon. grafana would replace the variables in the formula's, execute the expression (with the current bosun based executor). results (state changes) could be fed into something like elasticsearch and displayed via the annotations system.

Thoughts? Do you have concerns or needs that I didn't addres?

linkslice commented 9 years ago

I'd love to help out with this! My suggestion would be to stick with the nagios-style guidelines. That way the tools could easily be used with other monitoring tools. e.g. Nagios, Zenoss, Icinga, etc..

torkelo commented 9 years ago

The biggest thing about this feature is getting the basic architecture right.

Some questions i would like to explore 1) What components are required how are they run (in proc in grafana, out of proc), 2) How should things be coordinated. 3) Should we ignore "in stream" alerting, (only focus on pull based)

Going more in depth into 1) I am worried about making grafana-server into a monolith. Would like to find a way to seperate grafana-server into services that are more isolated from each other (and can be run either inproc or as seperate processes). This was kind of the plan with the bus abstraction. Another option would be to have the alerting component only speak to grafana via the HTTP api, might limit integration, not sure.

dahendel commented 9 years ago

I agree with torkelo. In my experience with other projects with everything "built-in" it can get quite cumbersome to troubleshoot. I like the idea of the service running externally, but a nice config page in grafana that talks to the service through the HTTP api to handle managing all the alerts. Also, for large scale deployments this would probably end up being a requirement as performance would eventually degrade (I would at least have this as a configuration option).

do we integrate with current grafana graph threshold settings (which are currently for viz only, not for processing) ? if the expression is a threshold check, we could automatically display a threshold line

I think that could be a good place to start. Alert if its set, don't if its not.

Back to number 1. I think that if the bosun service could run separately but still have the ability to completely configure everything through grafana that would be, in my opinion, ideal.

Keep up the awesome work.

Jhors2 commented 9 years ago

The only shortcoming I have seen with bosun is the data sources it can use. If you could leverage the language for expressing bosun alerting but also integrate with existing data sources that are configured via the regular grafana UI it would certainly be ideal.

Being able to represent alerting thresholds, when you are close to them, as well as automatically push annotations for when they have triggered in my mind make an ideal single pane UI.

Looking forward to the work that will be done here!

damm commented 9 years ago
  1. It should use the thresholds defined in the Dashboard to alert on Let's keep it simple; if the Dashboard shows the color for warning it should be alerting.
  2. This likely be something outside of the grafana-server process itself.
    ... Something that would use the rest api to scrape the dashboards and it's settings and render them and alert using an external command.
  3. Alerting level; just a box to drop in the editor that this Dashboard should be monitored; and it should be checked every minute. If there's no data it should still for a period it should still alert? (checkbox)

Lastly; as we depend on Grafana more I admit i'm willing to say 2. could be something i'd be willing to pay for.

dennisjac commented 9 years ago

I'm curious why people think this should be included into Grafana at all? Grafana neither receives nor stores that actual data but "only" visualizes it. Any alerting system should instead be based on the data in the metric store. If this is really integrated into Grafana I hope this can be disabled because over here we already use Icinga for alerting so any kind of alerting in Grafana would only clutter the GUI more even though it wouldn't be used at all.

damm commented 9 years ago

Absolutely correct @dennisjac; Grafana only renders things.

But as we've moved things server side it's no longer just client rendering; the possibilities of a worker process that could check your metrics and alert; is less difficult.

Data is in a database; provided it's sprinkled with the data that tells it to check the metric ...

Some people may agree or disagree that we should not cross the streams and make Grafana do more than visualize it (roughly) but I'm not them.

dennisjac commented 9 years ago

I'm not really opposed to the feature for people who want it to be integrated but I hope it will be made optional for people who already have monitoring/alerting systems available.

The new Telegraf project (metric collector from the influxdb guys) also is looking at monitoring/alerting features which is dislike for the same reason. I elaborated on this here: https://influxdb.com/blog/2015/06/19/Announcing-Telegraf-a-metrics-collector-for-InfluxDB.html#comment-2114821565

damm commented 9 years ago

I think torkelo has done a really good job at giving us features in Grafana2 that we don't have to enable.

As far as influxdb they're going to have to make some money somehow; either off of support of influxdb and professional services or products for it.

The latter sounds much more viable

elvarb commented 9 years ago

Another angle on this. There seems to be upcoming support for elasticsearch as a metric storage for grafana. Bosun can right now query elasticsearch for log data.

Would it make sense when designing the alerting system to allow for alerts from log data as well? Maybe not a feature for the first version, but something that can be implemented later.

Also I agree with the idea of splitting the processes. Have Grafana the interface to view and create alerts, have something else handle the alerting. Having the alerting part api based would also allow other tools to interface with it.

deebs031 commented 9 years ago

+1 to Alerting. Outside DevOps usage, applications built for end users need to provide user defined alerts. Nice to have it in the visualization tool...

j1n6 commented 9 years ago

+1 this will close the loop - the propose of getting metrics.

AdrianParente commented 9 years ago

+1 Alerting from Grafana + a Horizontally Scaling Backend from InfluxDB will make them the standard to beat for Metrics Alerting Configurations

lesaux commented 9 years ago

+1 I'd love horizontal scaling of the alerting on multiple grafana nodes.

rsetzer commented 9 years ago

It would be great if one could associate a "debounce" like behavior with an alert. For example, I want to fire an alert only if the defined threshold exceeds X for N minutes.

I have seen this with some of the alerting tools, unfortunately we are currently using Seyren which doesn't appear to provide such an option. We are using Grafana for our dashboard development and are looking forward to pulling the alerting into Grafana as well. Keep up the good work.

j1n6 commented 9 years ago

We have two use cases:

We would love to have an unified alerting system handles alerts, flap detection, escalation and contacts. That helps us recording and correlating events/operations in the same source of truth. A lot of system has solved the alerting problem. I hope Grafana can do better at this in long term, short term not to reinvent existing systems would be helpful in terms of deliverables.

One suggestion is Grafana can provide API for extracting monitoring definition (alerting state), third party can contribute configuration export plugins. This would be very ideal in our use case exporting nagios configuration.

More importantly, I would love to see some integrated anomaly detection solution too!

On 15 Jul 2015, at 17:40, Pierig Le Saux notifications@github.com wrote:

+1 I'd love horizontal scaling of the alerting on multiple grafana nodes.

— Reply to this email directly or view it on GitHub.

falkenbt commented 9 years ago

I agree with @activars. I don't really see why a dashboard solution should handle alerting which is a more or less solved problem by lots of other tools, mostly quite mature. Do one thing and do it well.

IMHO it would make more sense to focus on the integration part.

Example: Define dynamic warn/crit thresholds in grafana (e.g. like in @Dieterbe example above) and provide an API (REST?) that returns the state (normal, warn, crit) of exactly this graph. A nagios, icinga, bosun etc. could request all the "monitoring" enabled graphs (another API feature), iterate through the individual states and do the necessary alerting.

In our case service catalogs and defined actions are the hard part - which service is how business critical, where to send emails to, flapping etc. Also you would not have to worry about user / group management in grafana which most companies already have in a central place (AD, LDAP, Crowd etc.) and integrated with the alerting system.

Also we have to consider that unlike a dashboard solution the quality requirements for an alerting tool can be considered much higher in term of reliability, resilience, stability etc. which creates (testing) effort that shouldn't be underestimated.

Also what about non-timeseries related checks, like calling a webservice, pinging a machine, running custom scripts...would you want that in grafana as well? I guess the bosun adoption would provide all this but I'm not really familiar with it.

On the other hand I can image how a simple alerting system would make a lot of users happy that don't have a good alternative in place, but this could maybe be resolved with some example integration patterns for other alerting tools.

bigkraig commented 9 years ago

As much as I want Grafana to solve all of my problems, I think falkenbt hit the nail on the head with this one.

An API to expose the mentioned data, some plumbing in bosun, and some integration patterns with common alerting platforms makes a lot of sense.

jemilsson commented 9 years ago

Congratulations on your new job at raintank @Dieterbe! I have been reading your blog for a while and you have some really sound ideas on monitoring, particularly regarding metrics and its place in alerting. I am confident that you will find a good way implementing alerting in grafana.

As you probably would agree upon, the people behind Bosun are pretty much doing alerting the right way. The lacking thing with Bosun is really the visualizations. I would like to see Bosun behind the Grafana UI. Combining Grafanas dashboard and bosuns alerting behind the same interface would make for an awesome and complete monitoring solution.

Also i think it would be a shame to fragment the open source monitoring community further, your ideas on monitoring seem to be really compatible with the ideas of the people behind Bosun. If you would unite i am sure the result would be great.

Where i work we are using Elastic for logs/events and have just begun using InfluxDB for metrics. We have been exploring different solutions for monitoring and are currently leaning towards Bosun. We are already using Grafana for dashboards, but would like to access all our monitoring information through the same interface, it would be great if Grafana could become that interface.

Keep up the great job, and good luck!

sudharsh commented 9 years ago

On a related tangent, we got the alerting part working alerting working by integrating grafana with riemann. Was a nice exercise getting to know the internals of grafana :).

This was easier with riemann as the config is just clojure code. I imagine this integration is going to be harder in Bosun.

Here are a couple of screenshots of it in action screen shot 2015-07-21 at 7 14 25 pm

screen shot 2015-07-21 at 7 18 52 pm

screen shot 2015-07-21 at 7 30 36 pm

The changes to the grafana part included adding an "/alerts" and a "/subscriptions" endpoint and have it talk to another little api that sits on top for riemann to do the crud.

The nice thing is the fact that the changes in the alert definitions are reflected immediately without having to send a SIGHUP to riemann. So enabling/disabling, time period tweaks for state changes is just a matter of changing it in the UI and have that change propagate itself to riemann.

Still haven't benchmarked this integration but I don't think it's going to be that bad. Will blog about it after I cleanup the code and once it goes live.

The whole reason we did this was because people can just go ahead and set these alerts and notifications from a very familiar UI and not bother us to write riemann configs :).

dinoshauer commented 9 years ago

@sudharsh your implementation sounds really interesting. Are you planning on releasing this to the wild?

Dieterbe commented 9 years ago

lots of good ideas, thanks everyone. Inspired by some of the comments and @pabloa's https://github.com/pabloa/grafana-alerts project we decided to focus first and foremost on the UI and UX for configuring and managing alerting rules as part of the same workflow of editing dashboards and panels. Grafana would save those rules somewhere and provides easy access to it so that other scripts or tools can fetch the alerting rules. Perhaps via a file, an API call, a section in the dashboard config, or an entry in the database. (I like the idea of having it as part of the dashboard definition itself, so that open source projects can come with grafana dashboard json files for them which would have alerting rules included though not necessarily active by default. on the other hand having them in a database seems more robust) Either way, we want to provide easy access so you can generate configuration for whatever other system you want to use that actually executes the alerting rules and processes the events. (from hereon I'll refer to this as a "handler"). Such a handler could be nagios, or sensu, or bosun, a tool that you write or the litmus alert scheduler-executor which is a handler that you could compile into grafana which provides a nice and simple integration backed by bosun but we really want to make sure you can use whatever system you want.

As long as your handler supports querying the datastore you use. we would start off with simple static threshold but later also want to make it easy to choose reduction functions, boolean expressions between multiple conditions, etc.

@sudharsh that is a very nice approach. I like how your solution can talk directly to a remote API, bypassing the intermediate step described above (of course this does imply it only works for 1 given backend which we try to avoid), and that it can automatically reload the configuration. (you're right, bosun currently does not support it, it might in the future. FWIW the litmus handler does handle this fine and it uses bosun's expression evaluation mechanism). I never really got into riemann much. Mostly I've been concerned about adding such a different language to the stack that not many people understand or can debug when things go wrong. But I'm very curious to learn more about your system and about Riemann's CLJ code. (I'ld love it if my suspicions are incorrect)

@dennisjac yes it would be optional. @elvarb there is a ticket for ES as a datasource. in fact the goal is that if grafana supports rendering data from a given datasource it should also support composing alerting rules for it. As for query execution/querying this of course depends on what handler you decide to use. (for the litmus handler we'll start out with the most popular ones like graphite and influxdb) @rsetzer : agreed, it's a good thing to be able to specify how long a threshold should be exceeded before we trigger @falkenbt : I believe many things can be phrased as a timeseries querying problem (for example the pings example). But you're right, some things aren't really timeseries related and those are out of scope for what we're trying to build here. And I think that's OK: we want to provide the best way to configure and manage alerting on timeseries and aim for integration with other systems that are perhaps more optimized for the "misc scripts" case (such as nagios, icinga, sensu, ...). As for concerns such as reliability of delivery, escalations etc, you could hook in a service such as pagerduty. @activars & @falkenbt does this seem to match your expectation or what do you think could be improved specifically? @jemilsson thank you! and that's exactly how i see it: bosun is great at alerting but not good at visualization. Grafana is great at visualization and UX but has no alerting. I'm trying to drive a collaboration which will grow over time I think

Does anyone have any thoughts on what kind of context to ship in notifications like emails? At the very least, the notification should contain a viz of the data you're alerting on, but it should imho be possible to include other related graphs. Here we could use grafana's png rendering backend when generating the notification content. I'm also thinking about leveraging grafana's snapshot feature. like when an alert triggers, take a snapshot of a certain dashboard for context. and maybe that snapshot (html page) could be included in the email, or that might be a bit too much data/complexity. also the javascript features would be unavailable in mail clients anyway (so you wouldn't be able to zoom on graphs in an email). Perhaps we could link from the email to a hosted dashboard snapshot.

felixbarny commented 9 years ago

I like the general approach of docker - batteries included, but removeable. So a basic alerting implementation that can be swapped out would be a good approach imho.

JulienChampseix commented 9 years ago

influxdb will be supported for alerting ? or only graphite ?

nickman commented 9 years ago

One thing I would like to see is the idea of hierarchical alert trees. There's simply too many facets being monitored and stand alone alert states have an unmanageable cardinality. With a hierarchy tree, I can define all these low level alerts which roll up to medium level alerts which roll up to high level ......

As such, each rolled up alert automatically assumes the high severity of all the children below it. In that way, I can get an impression of [and manage] system health accurately with a much lower surface area of analysis.

This is an example I have borrowed from an old document I wrote a while ago. Yes, please chuckle away at the use of the word "Struts". It's OLD ok ? This presents a very simple hierarchy for one server.

image

At some point, the server experiences sustained 75% CPU utilization, so this trips these alerts into a warning state: CPU-# --> CPU --> Host/OS --> System

image

If one really applied themselves, one could keep an eye on an entire data center with one indicator. (yeah, not really, but this serves as a thought excercise)

image

ahrajabi commented 9 years ago

Why do not use graphite-beacon? I think you can merge graphite-beacon that is very light with grafana.

Dieterbe commented 9 years ago

@felixbarny I like that terminology. we'll likely adopt that wording. @JulienChampseix yes the standard handler would/will support influxdb @nickman that's interesting. it actually falls in line with the end-goal we have in mind, of being able to create very highlevel alerts that can include / depend on more fine grained alerting rules and information. bosun already does this, and long term we want to make this functionality available through a more user friendly interface, but we have to start more simple than this. @amirhosseinrajabi looks like a cool project and I think making graphite-beacon into a handler for the alerts configured through the grafana UI would make a lot of sense.

JulienChampseix commented 9 years ago

@Dieterbe is it possible to have an update of the current status ? for alerting system in order to know which system is comptatible (graphite/influxdb) ? which subscribtion available ? which alert type available ? thanks for your update.

Dieterbe commented 9 years ago

we're currently prototyping the UX/UI. so we're quite a ways from this being usable.

amitgilad3 commented 9 years ago

Hi @Dieterbe

is there any updates on the progress of the alerting system ??

MichaelScofield commented 9 years ago

It would be awesome to have alerting in Grafana! Looking forward to this feature. Any updates now?

Dieterbe commented 9 years ago

@mattttt can you provide an update re your UX work?

mattttt commented 9 years ago

Yes, absolutely. Will upload some screens/user flows tomorrow.

andyl commented 9 years ago

We need alerting: UI for rule definition, API for rule definition, and API for alert notifications. Will watch this thread with interest. We have a multi-tenant system and love the Grafana UI and back-end.

stef609 commented 9 years ago

Yes, I'm also very interested and impatient for seeing this new feature! Thanks a lot Matt ! ;)

2015-08-27 6:49 GMT+02:00 andyl notifications@github.com:

We need alerting: UI for rule definition, API for rule definition, and API for alert notifications. Will watch this thread with interest.

— Reply to this email directly or view it on GitHub https://github.com/grafana/grafana/issues/2209#issuecomment-135290295.

mattttt commented 9 years ago

There's a lot of items falling into place internally, but I didnt want to leave this thread neglected.

Here's one of the mockups of a panel I've been working on. This illustrates the historical health over time, incorporating the status into the tooltip and using existing thresholds defined in the panel config to configure alerting.

In this example, this is alerting on a single query w/ multiple series. Tooltips are extended to show the status at the hover-time.

image

_A couple small outstanding questions_: How much information about the alert notification should go into the tooltip, if any - or should this information be accessed elsewhere in a more detailed view? I believe the latter at this time, but it's worth asking aloud.

Configuration, alerting screens, user flows are forthcoming slowly. Lots to come.

torkelo commented 9 years ago

@mattttt nice!

elvarb commented 9 years ago

Love the green and red line below the chart!

That ties to uptime calculations, would love to be able to see a that as a number somewhere. Totals for all queries and for each metric.

About the tooltip are you talking about the stats that come up when you hover over the lines?

lynchc commented 9 years ago

Damn @mattttt that looks awesome. I wouldn't even worry about putting anything in the tooltip. The threshold line and alert status health bar at the bottom are more than enough.

Can't wait to see this when it's done!

IzakMarais commented 9 years ago

I am excited to see this is progressing well!

We currently use Grafana+Bosun+OpenTSDB as our monitoring&alerting stack. I definitely agree it would be awesome to have the power of Bosun with the great UX of Grafana.

Here is an example of where Grafana's configuration UX is better than Bosun's:

Context

Our monitoring stack is shared between multiple teams and their services. A different set of services is deployed to different clusters/sites based on project specifications. Each team/service should take responsibility for its own dashboards/alerts.

Comparison

With Grafana's HTTP API teams can PUT their own dashboards their when deploying their service. Bosun currently only has a single file for storing configuration; this makes it difficult to share between different teams and across different projects.

JulienChampseix commented 9 years ago

@mattttt @torkelo @Dieterbe any idea for a release the alerting piece (or beta release) ?

dahendel commented 9 years ago

echo ^. Is their a beta or alpha release for this? I am researching alerting solutions, but I would love to have something built in grafana. I could provide a lot of testing feedback.

torkelo commented 9 years ago

Alerting feature is still some months in the future, we are still prototyping the UI and thinking about different ways to implement it, but progress should move more quickly in the next 2 months so we will know more then

baaym commented 9 years ago

@mattttt do you intend to make the colors of the historical health bar configurable? Green and red don't really go well with the color blind ;)

Regarding the alerting: I'm quite interested in how this plays out. We've been collecting and visualizing data for a while now, and alerting is something we're currently trying to figure out. Grafana could have a nice place there, especially since the visualizations are already in place. What I do wonder though: how much should Grafana be more aware of 'entities' rather than metric series for alerting? I can imagine myself wanting to automatically toggle a visual state change (ping or http check fails) or manually (doing maintenance) for what in my case would be a server, in addition to metric based alerting.

phrend commented 9 years ago

I'm interesting to see where alerting in Grafana goes, but for those of you that need something now, there are nagios plugins like https://exchange.nagios.org/directory/Plugins/System-Metrics/Others/check_graphite_metric/details that can trigger alerts when thresholds are crossed.

Dieterbe commented 9 years ago

@baaym

What I do wonder though: how much should Grafana be more aware of 'entities' rather than metric series for alerting? I can imagine myself wanting to automatically toggle a visual state change (ping or http check fails) or manually (doing maintenance) for what in my case would be a server, in addition to metric based alerting.

this is a good question and also something we've been talking about a bit. the solution we want to go with for the near (and possibly also long) term is to make grafana not aware of such higher level concepts. i.e. as a user you have the power to set alerts on metric series, and from those alerting rules, alerting outcomes will be generated (likely including attributes or tags from the series names) from which you can then construct whatever entities you please. this is something we have to think about a bit more, but for example

say you set an alert along the lines of movingAverage(cluster1.web-*.cpu.idle,10) < 20 -> warn. this would verify the threshold on all your web servers in the given cluster, and generate alerts for any violations such as movingAverage(cluster1.web-123.cpu.idle,10) is currently 3!. possibly we could enable you to say "the first field is the cluster name, the 2nd is the hostname" etc, so that alerts can contain nicer information. but the point is, the alerting outcome contains the information you need to identify which entity is having issues, but it falls outside of grafana's scope. Grafana would be more the source of the configuration of alert rules, and grafana dashboards could be configured to load annotations and what-have-you to visualize state of the alerts, but it wouldn't have a notion of highlevel concepts such as hosts or clusters. I think this is something that could be handled in the alerting handlers

j1n6 commented 9 years ago

@Dieterbe

There are two type of user/organization concerns when building alerting feature:

Grafana should work with existing well-established operation practices, having it outside the cycle is ignoring the goal of alerting - having clear view of business critical entity healthiness. Alerting is better to be centrailzed to allow building clear state of the environment. It would be critical to allow power users using grafana API (or any other solution) to export alerting rules to other systems.

When saying operational, each alert should optionally contain a documentation/link field to explain the purpose of the alerts & historical behaviour.

Dieterbe commented 9 years ago

@activars i think i agree with all of that. in my view we are taking an approach that fosters plugging grafana into the rest of the environment (mainly thanks to separation of concerns, with pluggable handlers). do you think the proposed design can be improved in any way?

jaimegago commented 9 years ago

I think @deebs031 makes a good point that hasn't been addressed much "applications built for end users need to provide user defined alerts". IMHO the graal is self service metrics based monitoring, in my case Grafana being the main front end for folks that want to look at metrics it would make sense to enable them to create alerts for themselves within the same -let's face it AWESOME- UI. I've personally done Sensu alerting based on metrics but providing it as a self service is really not a piece of cake compared to how seamless it'd be if integrated with Grafana. I've also looked at Cabot because it had visualization capabilities but it hasn't been built with self service in mind so couldn't use it as is. I'm on the "do one thing well" side but I think in the particular case of self service alerts based on metrics pairing that capability with the metrics visualization layer makes a lot of sense:

Dieterbe commented 9 years ago

slides of my grafanacon presentation about alerting: http://www.slideshare.net/Dieterbe/alerting-in-grafana-grafanacon-2015 they're kinda hard to understand without context, the video should be online in about a week, i'll post it when it's ready.

we've now started prototyping ways to implement the alerting models/UI/definitions/etc. we have a pretty good idea of the main workflow, though 1 big point we're still trying to figure out is how integration with 3rd party alerting handlers should look like. our current thinking is that you will be able to use grafana to set thresholds/alerting rules/define notifications and to visualize the historical and current state of the alerting rules.

the assumption is that you want to use your alerting software of choice (bosun/cabot/sensu/nagios/...) so there would be a separate tool that queries grafana over its http API to retrieve all the alerting rules. that tool could then update your bosun/cabot/sensu/nagios/... configuration, so that you can use your alerting software of choice to run and execute the alerts and send out the notifications. but we want to be able to properly visualize current and historical state, so either your alerting program would need to be able to call a script or a webhook or something to inform grafana of the new state, or grafana would have to query for it. (which seems yucky, given most tools don't seem to have great api's) this is all a bit complicated but it would have to be this way to support people for whom it is important that they stay able to use their alerting software of choice, while using grafana for defining the alerting rules and visualizing the state.

Is it important to you that you're able to use your current alerting tool of choice? what would be your alerting tool of choice?

the other thing we'd like to do, is also write a simple alert executor ourselves, that queries the grafana api for alerts, schedules and executes them, does the notifications (it would support email, slack, pagerduty, a custom script, and probably a few others) and updates the state in grafana again. it would be fairly easy to write for us, easy for you to use and we could have great interoperability.

Is the built-in alerting executer (see above) something that you think would be sufficient to handle all the alerting rules you set up in grafana? or do you feel strongly that you want to keep using nagios/bosun/... and that we have to build a bridge between them?

also do you want to be able to use multiple alerting handlers? say built-in + bosun + nagios, or something? which ?

@jaimegago amen ;)