Closed Dieterbe closed 8 years ago
For me number 2 seems better in that you can really minimize the number of things you have to configure for everything to work smoothly. In our current setup we would go with that.
Just so it has been said if everybody else disagrees ;)
Quick edit: Awesome slides. If the ending product comes out looking half as good as that then it's amazing.
+1 I agree that internal notification handler with this integrations is perfect! for most common use case.
I'll be happy to be Beta Testing :) and slides be amazing!
I think @Dieterbe's last post clears things up quite a bit, but I wanted to post this quick diagram to further clarify.
Alerting in Grafana is really two things, the self-service alert config (thanks @jaimegago, couldnt have said it better myself) and the handler itself.
We'll be shipping a Grafana alert handler, but also providing the framework to integrate with your alerting software of choice:
+1 for building a kind of bridge to other alerting systems (maybe we could think about implementing some generic alerting plugin system :-) )
You could add Prometheus
also on the "External Alert Handlers" part. The first Prometheus alertmanager version is in production in several companies and a complete rewrite is currently in the works. SoundCloud might use Grafana to configure alerts, but very certainly only if the Prometheus alertmanager is used as alert handler.
@grobie, good catch, fixed in original comment.
@mattttt @Dieterbe that's great ! Looks like you're on the path of "batteries included but removable" which is IMHO the best of both worlds. Have you already thought about how you are going to pass authorization data to the alert handler? I'm thinking a story like this: As a Grafana user I'd like to be alerted via email and/or pagerduty and/or foo when (some condition built via Grafana alerting UI) happens. That user should only be able to send alerts to the notifications system he is authorized for, this is a requirement for self service and will need to be addressed someway somehow. Since Grafana 2 we have a SQL backend + users authentication/authorization with LDAP integration, so it doesn't seem too far fetch to have that capability from day one of alerting? With Sensu (which is the tool I'd be plugging in) passing the alert target (e.g. email address) via the handler should be quite straight forward, can't say about the others.
Hi all, I am happy to see that this afford is getting pushed forward, since I love the self-service alert configuration approach.
Personally, I don't need a specific alert handler. I would like to see a generic HTTP POST handler, that is just triggered as soon as an alert it thrown. I think most admins can quickly build something that is capable of accepting HTTP and then doing whatever they need to do with it (sending it to nagios, riemann, younameit). So I would be happy with an HTTP handler that sends all informations about an alert as JSON data.
Talking about alerting via grafana, are you planning to add something like flapping detection? Or is this something that the external monitoring system should take care of?
Keep up the good work guys!
Cheers
I would like to see a generic HTTP POST handler, that is just triggered as soon as an alert it thrown. I think most admins can quickly build something that is capable of accepting HTTP and then doing whatever they need to do with it (sending it to nagios, riemann, younameit)
so if an alert fires (say "web123 has critical cpu idle!, value 1 lower than threshold 15" ) and we do an http post of that data, how would you handle that in nagios? you mean nagios would take it in as a passive service check, and then nagios would send the notifications?
Talking about alerting via grafana, are you planning to add something like flapping detection? Or is this something that the external monitoring system should take care of?
this is also something we need to think about more. This can get messy and if people use something like pagerduty or flapjack then they can use that to aggregate events/surpress duplicates so we're looking if we can avoid implementing that in the grafana handler, though we may have to. Note also that because you'll be able to set alerts on arbitrary metrics query expressions, you'll have a lot of power to take past data into account in the actual expression, and so you can create a more robust signal in the expression that doesn't change state as often.
so if an alert fires (say "web123 has critical cpu idle!, value 1 lower than threshold 15" ) and we do an >http post of that data, how would you handle that in nagios? you mean nagios would take it in as a >passive service check, and then nagios would send the notifications?
Kind of. I'm actually looking forward to grafana alerting to get rid of nagios. Using the HTTP handler, you need to configure passive checks for nagios to be able to submit the results there. But I would like to have grafana as the one source where you can configure alerting. In our case, the people who are allowed to add alerts are the actual sysadmin who would also configure the checks in nagios.
With the http handler grafana would have everything we need for that: a dashboard for real time monitoring, an API, easy alerting configuration and a http handler where we can forward alerts to our internal notifcation system.
Cheers
Although I can see the logic in this integration strategy, I cannot help but think it's a little bit overkill. To what I understand (and what I could read in the thread), the only reason most Grafana users keep using a standalone alerting technology is that Grafana doesn't propose one. So, wouldn't be more "lean" to focus first on the Grafana Alerting part, and develop it as a component that communicates with the rest of the stack through the API, so that other contributors would be able to mimic the behaviour and create specific adapters later?
tl;dr: by focusing on building its own "batteries" first, Grafana would have a full-featured alerting system, that can later evolve into a service for integration with 3rd party alerting tools.
Minor concern, if this hasn't been addressed. The traditional alerting system does not scale well for cloud infrastructure, because resources are very dynamic (provisioned & destroyed). Alerting on metrics should support tempting or grouping feature (with exceptions override, sometimes workloads are different). A templated or grouped alert should be able to discover new group set.
Thanks for the update! In my use case built-in alerting in Grafana is all I would need at this time. I have been patiently impatiently waiting for Grafana alerting.
As I promised in IRC, here is our use case for this:
We have a legacy Rails-app which searches our logs
for patterns
and has an
HTTP API to answer if a certain pattern
has crossed it's thresholds
and
thus has a status of {OK,WARNING,CRITICAL}
.
Threshold
can be either:
status
of CRITICAL
if pattern
exists at all.status
is WARNING
if pattern
is found more than X times
and status
is CRITICAL
if found more than Y times.pattern
is less than 1 hour old then the status
is OK
,
less than 3 hours status
is WARNING
and otherwise status
is
CRITICAL
.If I understand this feature correctly, Grafana will support this usage pattern (via Logstash and Elasticsearch of course) when this feature and the Elasticsearch data source is fully implemented?
@Dieterbe @mattttt your slides and mock-ups look absolutely amazing! This is truly a game changer. To me the internal Grafana alert handler would suit our needs the best. Reasons:
A few concerns:
I think it's ideal to be able to integrate alerting into existing alert systems. There are some tough and ugly problems like flap detection as mentioned that have been dealt with and it seems wasteful to reinvent everything from the start. I'd hate to see this buried under the weight of feature creep.
But I don't think this really needs to be a tight integration into all of these alert handlers. A good, well documented api should allow people familiar with these system to integrate with little effort. So the slide with 'grafana api -> handler' is what looks attractive to me.
Scott
Hi all -- I'm coming in late to this discussion, but I have some expertise on this topic, and I'm the lead developer of one of the tools that has attempted to solve the alerting problem. Our tool, StatsAgg is comparable to programs like bosun. StatsAgg aims to cover flexible alerting, alert suspensions, and notifications & is pretty mature/stable at this point (although the API is not ready).
Anyway, some thoughts on the subject of alerting:
Those are the main things that I wanted to mention. There are lots of other important aspects to alerting too (notifications, suspensions, etc), but those solutions are easy to bolt on once one has the architecture of core problem solved. I realize that the scale of our needs are not the same as the average user, but hopefully you all can appreciate this perspective.
I'd like to suggest launching with a notification handler that can send data to Alerta: https://github.com/guardian/alerta
Alerta has a very sane REST API for receiving notifications.
I prefer a lean grafana only implementation! I think its worth re-evaluating after everyone has had experience with this feature in the typical fantastic Grafana UX.
There will be many complex cases and/or custom backend systems people will want to integrate with. You can see many on this thread, mostly open source, but there are so many commercial products out there as well ! Don't bother with individual handlers - it will be a rat whole and you always will be in a catching mode
I'd strongly advise to implement only two types of handlers. One is definitely HTTP POST, it will be the most versatile and flexible tool. The other one is custom scripting, so the users can implement integration with their specific tool of choice. Plugin model is not bad, but it forces to use specific plugin language which is limiting. External scripts are better - as long as you pass to a script all the details the script can be written in any language - shell script, Python, etc.
I'm with @007reader
I agree. As long as common integration methods provided, custom implementation can be separate project or deployment.
For example, the recent CloudWatch release is decent, however I would love to make it as a separate project by only synchronous selected metrics to alternative storage. It will give us full retention instead of only 2 weeks data.
hey everyone, my grafanacon presentation video is online! it's at https://www.youtube.com/watch?v=C_H2ew8e5OM i think it will clear up a lot, though as you can see. the specifics of integrations are still to be figured out and was also a topic a lot of people wanted to discuss. (though there was limited time and i asked people to continue the conversation here so everybody can participate)
@simmel yes exactly. you'd use an ES query and set a rule on that. @activars re grouping and discovery, i think a lot of that will depend on your datasource, but most common requirements should be addressed if you use something like graphite or ES who i know are very good at "auto discovering" previously unseen metrics/series/documents that match the given expression (with wildcards) for graphite or query (for ES). not sure about the other sources. your comment about needing to apply exceptions to rules is a trickier one, we'll probably need to address that at some point but I think that can wait until things are clearer and more settled down. maybe we can avoid needing that somehow. @mgravlin frequency will be a setting in the rule, dealing with too slow datasources, not sure yet. but don't do it ;-) . also handler dependent. scale-out deployment should be possible, also with included handler so pretty sure we'll look at that. but maybe not a priority for first release. yes, notification plugins will be key and we'll make sure you can use what you want/need. re api: yes :) @sknolin if i understand correctly, in your view, grafana would do the alert scheduling, execution, trigger notification hooks etc, even when using another system like nagios/bosun. then what exactly would be the role of the external system (nagios/bosun/...) . this seems also similar to what @Crapworks was talking about. @jds86930 StatsAgg looks quite interesting. I think here also an integration with grafana would make sense. I think stream processing is a valid approach that has a place as an alternative to repeated querying. but the latter is just simpler to get started with and just simpler in general I think. But both should be supported. So yes in grafana you will be able to set up patterns/queries that match a spectrum of data, and potentially cover new series/data as it becomes live. in our view though, you would just leverage whatever functionality your datasource has (for example graphite is pretty good at this with its wildcards, glob expressions etc, and elasticsearch rich data and query model), or if somebody would use grafana+StatsAgg they could just use StatsAgg to solve that. Are you saying grafana itself should do anything specific here? I think if your datasource is not fast enough, solve the datasource problem. get something faster, that has caching for metric metadata, maybe a memory server in front or stream processing. but either way as far as Grafana is concerned, not much would change that I can think of? @blysik yes looks interesting. so many alerting tools out there we should integrate with :) to be clear, do you like the idea of managing alerting rules and visualizing them in grafana the way it's been presented so far, but you want to use alerta to take care of the notifications? would alerta be the primary place where you go look at the state of your alerts (that seems like a reasonable thing to do), but want to make sure i fully understand how you see the integration.
@007reader , @shanielh , @activars just to be clear, this integration via a generic HTTP post or script, what would be the goal. to tell the external system "there's a new rule, here's the query, the thresholds, frequency, etc. now go please execute it" ? or would grafana be the thing executing the rules and then updating the external systems with new state?
@blysik yes looks interesting. so many alerting tools out there we should integrate with :) to be clear, do you like the idea of managing alerting rules and visualizing them in grafana the way it's been presented so far, but you want to use alerta to take care of the notifications? would alerta be the primary place where you go look at the state of your alerts (that seems like a reasonable thing to do), but want to make sure i fully understand how you see the integration.
Correct. Alerta is shaping up to be our notifications hub. All sorts of things sending alerts into it. For example: Custom scripts, Cabot, Zenoss, vCenter, and maybe Grafana. That gives ops a single place to see all the alerts. And then that is the single place which drives notifications to the oncall engineer.
@sknolin https://github.com/sknolin if i understand correctly, in your view, grafana would do the alert scheduling, execution, trigger notification hooks etc, even when using another system like nagios/bosun. then what exactly would be the role of the external system (nagios/bosun/...) . this seems also similar to what @Crapworks https://github.com/Crapworks was talking about.
I guess I didn't explain well. That's not what I'd want, grafana not doing all that stuff. @Crapworks (that's fun to type) is talking passive service checks, I'd just use active polling.
So all I would want is an api where I can read grafana alerts status. External systems do everything else.
That doesn't mean if it didn't develop somehow into a great general alerting tool I wouldn't use it. Just what I'd do now.
Scott
@sknolin
So all I would want is an api where I can read grafana alerts status.
how would that status be updated in grafana? what process would be executing alerts and updating the status in grafana?
Each time it's polled grafana updates alert status, with some kind of caching interval to deal with multiple systems polling it.
I see the point that this still requires grafana to do logic for the alerts and present it. So thinking about it, no I don't need alerts of any kind.
I think I could do whatever alerting needed if I were able to retrieve the current value for a metric on a graph panel. For example where we derive a rate from the sum of several counter metrics and graph it, would be nice to poll for the current value with the monitoring system. Maybe that's totally doable now and I'm just being obtuse.
Scott
@Dieterbe The latter :
grafana be the thing executing the rules and then updating the external systems with new state
@Dieterbe I agree that polling the datasource (Graphite, OpenTSDB, etc) using the datasource's native query syntax is simplest/easiest & is probably the quickest way to get some form of alerting natively into Grafana. For a lot of people, this sort of solution will meet their needs, and I think this is the best solution for the initial Grafana implementation (in my opinion). My main point was that there is a ceiling on alert configurability & performance that will be hard to work past with the 'polling the datasource' model.
In terms of directions Grafana could go for long-term alerting solutions, I could see a few options:
Hi all,
@Dieterbe I just watched your presentation and the stuff looks awesome. I really appreciate that you are trying to build an alerting system in the "right" way, using a lot of community input! Thanks!
I also want to make my point a bit clearer, since I don't think it was obvious what I was trying to say. I DO NOT require grafana to be aware of any other monitoring system as Nagios, Icinga, Bosun, etc. I actually only need this:
The event-flow I am thinking of:
Like @mgravlin and @007reader mentioned, most notification and alerting services use HTTP POST, requiering diffent kinds of data. So the most generic thing I could think of, is let the user define their POST data and headers, so you can feed several systems with one handler, using different templates. Pseudo code example:
"notificator": {
"httppost": {
"data": {
"host": "$hostname",
"alert": "$alertname",
"state": "$state"
},
"header": {
"content-type": "application/json"
}
}
}
Giving the user enough variables to use here, will be powerfull enough to feed a ton of backends.
And again, with that kind of handler, any sysadmin with some coding knowledge can build it's own http post receiver and transform how he likes to, for example, feed backends that doesn't understand http post.
Since this is stateless, it also scales out. Just place a load balancer in front of the backend/api/whatever and you are good to go.
At least, this would solve most/nearly all of my problems ;)
Cheers
Thanks for building this feature. Is there an aproximate release date for it?
torkelo said ROUGHLY 3 months on IRC. If I understood him correctly that's a really rough estimate and should be treated as such.
I am excited for the ability to do alerting with grafana. I think that this is the one feature that is keeping grafana from being the ultimate monitoring tool.
If you have an early alpha/beta release, I would love to test & give early feedback with production data.
++
Me 2 lol
+1
Em seg, 16 de nov de 2015 às 21:03, Jing Dong notifications@github.com escreveu:
If you have a early alpha/beta release, I would love to test & give early feedback with production data.
— Reply to this email directly or view it on GitHub https://github.com/grafana/grafana/issues/2209#issuecomment-157202686.
+1 If you have an early alpha/beta release, I would love to test & give early feedback with production data.
+1 me 2
2015-11-21 14:44 GMT-02:00 chaosong notifications@github.com:
+1 If you have an early alpha/beta release, I would love to test & give early feedback with production data.
— Reply to this email directly or view it on GitHub https://github.com/grafana/grafana/issues/2209#issuecomment-158661279.
+1
great to see all the +1's but FWIW they're not really needed. we already know it's the most eagerly awaited new feature, and once we have tangible progress, the code will appear in a separate branch that anyone can play with. BTW we're also bringing in more people to work full-time on grafana so stay tuned everyone :)
Yes please, there are 484 people "watching" this issue. Each time you "+1" it, 484 people get email notification. Just subscribe to notification and it'll be indication of your interest to the issue.
+1 >; P
On Mon, 2015-11-30 at 10:33:52 -0800, Vadim Chekan wrote:
Yes please, there are 484 people "watching" this issue. Each time you "+1" it, 484 people get email notification.
Sory, I know you guys are working very hard on this. Is there any time line for first release?
I would be more than happy with being able to configure alerting metrics and thresholds (either through the web interface or API) and a Grafana cronjob/daemon that checks these metrics and does a HTTP POST with JSON or invokes a script with JSON in stdout. It would be extremely simple for individuals to write a simple python script that passes this information on to Pagerduty, Slack, IRC, SMS, email or whatever.
While I would highly appreciate the convenience, I don't think it's Grafana's job to tightly integrate with third-party tools, and would rather see a minimalist implemetation sooner than a well fleshed out one later.
I completely agree with @anlutro . Please provide something simple to get started. The most interesting thing for me is to enable the people to set simple alerts themselves (self-service). Grafana shouldn't try to replace existing alerting/escalating solutions.
I agree with @anlutro as well. But instead of just providing a simple api rather have the alerting part able to handle custom plugins that interact with the api. That way the base package could include email, pagerduty and a few others then the community could add to it as needed. Similar to how Logstash plugins are handled now.
+1
Any news on alerting system? Any estimates?
+1
Hi everyone, I recently joined raintank and I will be working with @torkelo, @mattttt , and you, on alerting support for Grafana.
From the results of the Grafana User Survey it is obvious that alerting is the most commonly missed feature for Grafana. I have worked on/with a few alerting systems in the past (nagios, bosun, graph-explorer, etsy's kale stack, ...) and I'm excited about the opportunity in front of us: we can take the best of said systems, but combine them with Grafana's focus on a polished user experience, resulting in a powerful alerting system, well-integrated and smooth to work with.
First of all, terminology sync:
I want to spec out requirements, possible implementation ideas and their pro's/cons. With your feedback, we can adjust, refine and choose a specific direction.
General thoughts:
The integrations are a long term effort. I think the low hanging fruit ("meet 80% of the needs with 20% of the effort") can be met with a system that is more closely tied to Grafana, i.e. compiled into the grafana binary. That said, a lot of people confuse seperation of concerns with "must be different services". If the code is sane, it'll be decoupled packages but there's nothing necessarily wrong with compiling them together. i.e. you could run:
That said, we don't want to reinvent the wheel: we want alerting code and functionality to integrate well with Grafana, but if high-quality code is compatible, we should use it. In fact, I have a prototype that leverages some existing bosun code. (see "Current state")
Current state
The raintank/grafana version currently has an alerting package with a simple scheduler, an in-process worker bus as well as rabbitmq based, an alert executor and email notifications. It uses the bosun expression libraries which gives us the ability to evaluate arbitrarily complex expressions (use several metrics, use boolean logic, math, etc). This package is currently raintank-specific but we will merge a generic version of this into upstream grafana. This will provide an alert execution platform but notably still missing is
these are harder problems, which I hope to tackle with your input.
Requirements, Future implementations
First off, I think bosun is a pretty fantastic system for alerting (not so much for visualization) You can make your alerting rules as advanced as you want, and it enables you to fine-tune over time, backtest on historical data, so you can get them just right. And it has a good state machine. In theory we could just compile bosun straight into grafana, and leverage bosun via its REST api instead of Golang api, but then we have less finegrained control and for now I feel more comfortable trying out piece by piece (piece meaning golang package) and make the integration decision on a case by case basis. Though the integration may look different down the road based on experience and as we figure out what we want our alerting to look like.
Either way, we don't just want great alerting. We want great alerting combined with great visualizations, notifications with context, and a smooth workflow where you can manage your alerts in the same place you manage your visualizations. So it needs to be nicely integrated into Grafana. To that end, there's a few things to consider:
Basically, there's a bunch of stuff you may want visualized (V), and a bunch of stuff you want alerts (A), and V and A have some overlap. I need to think about this a bit more and wonder what y'all think. There will definitely need to be 1 central place where you can get an overview of all the things you're alerting on, irrespective of where those rules are defined.
There's a few more complications which I'll explain through an example sketch of how alerting could look like:
let's say we have a timeseries for requests (A) and one for errorous requests (B) and this is what we want to plot. we then use fields C,D,E to put stuff that we don't want to alert on. C contains the formula for ratio of error requests against the total.
we may for example want to alert (see E) if the median of this ratio in the last 5min ago is more than 1.5 of what the ratio was in the same 5minute period last week, and also if the errors seen in the last 5min is worse than the errors seen since 2 months ago until 5min ago.
notes:
other ponderings:
stats.$site.requests
andstats.$site.errors
, and we want to have seperate alert instances for every site (but only set up the rule once)? what if we only want it for a select few of the sites. what if we want different parameters based on which site? bosun actually supports all these features, and we could expose them though we should probably build a UI around them.I think for an initial implementation every graph could have two fields, like so:
where the expression is something like what I put in E in the sketch. for logic/data that we don't want to visualize, we just toggle off the visibility icon. grafana would replace the variables in the formula's, execute the expression (with the current bosun based executor). results (state changes) could be fed into something like elasticsearch and displayed via the annotations system.
Thoughts? Do you have concerns or needs that I didn't addres?