shinken-solutions / shinken

Flexible and scalable monitoring framework
http://www.shinken-monitoring.org
GNU Affero General Public License v3.0
1.13k stars 336 forks source link

Adding a new daemon: the analyst #1238

Closed aviau closed 10 years ago

aviau commented 10 years ago

Hello :)

I am proposing the addition of a new optional daemon to Shiknen, the analyst.

This new component would allow for the following functionalities:

The analyst would greatly enhance Shinken's alerting capabilities and would also simplify alerting on passive results.

I would like to propose myself as the main developer for this new daemon, I would be working on this at Savoir-Faire Linux where we already have two other Shinken specialists and contributors.

naparuba commented 10 years ago

That's an interesting idea and in fact i'm already thinking about such rule system since some months.

And after look at some points, it do not fit the shinken way of thinking.

In fact, the whole shinken thing is done to minimise the inter-daemons intereaction, by a diff-bus (broks), objects that are linked togethers are put in the same schedulers (arbiter role) and do not manage the history (all in memory).

Shinken is like nagios, it's done to manage status, not metrics. There are tools for this and i don't think it's a good thing to have a deamon that do not fit the same logic that all the others.

But that's about philosophy, let's look at real use case now :)

We need such perfdata lookup for some cases. that's was triggers where done. the main problems is that they are in the scheduler that got a limited view in the objects (only a part of the It hosts). and that's why I never really loved the triggers.

One other key problems you will have to tackle, and I don't see in your post is how you will manage the distribute load between your daemons. rules assignements won't be enough for large env.

What you really need are in fact triggers (your python function that can then be LUA) in a metrology tool. It do not have to be in a monitoring tool, but only to send back alert/states to a monitoring tool.

You daemon is basically a graphite with rules that is generating perfdata into graphite as well (like the grahite agregator?) that is getting back alerts in a receiver. You cnan propose them to add such a alert levels, or maybe in collectd that already have such capabilities.

Maybe you can swith from graphite to elasticsearch if you only look at not so far data and you want scalability, but it's up to you :)

For the rule engine, you can look at riemman, it's not python nor lua, but it's still cool :+1: and you already have a ready to run event machine. If I'm not wrong it is in scala (and also the rules). It will also save you lot of coe lines :godmode:

On good thing on the shinken part can be a scheduler module that will allow the check_commands to grab data from the backend so you won't have to manage the reinsertion and the names mapping (can always be hard in the passive->active logic).

Good luk for your project, and let us know how it evoles, I think we will be please to add "function macros" linked from modules in the scehduler so we can test it in shinken. Maybe in this 2.2 version if we got time for it :+1:

aviau commented 10 years ago

@naparuba Thank you for your feedback :+1:

One other key problems you will have to tackle, and I don't see in your post is how you will manage the distribute load between your daemons. rules assignements won't be enough for large env.

Why? You mean that some rules will take too much load for a single daemon? This is definitely a problem we will have to think about.

What you really need are in fact triggers (your python function that can then be LUA) in a metrology tool. It do not have to be in a monitoring tool, but only to send back alert/states to a monitoring tool.

I have not thought about it like this, but you might be right. I will take some time to consider this approach.

For the rule engine, you can look at riemman

I am looking into Riemann right now. I will try to evaluate wether or not it could help me fill my use cases. I just want to make sure that it doesn't become a limitation because we cannot alter it to solve Shinken-specific problems.

Good luk for your project, and let us know how it evoles

I will try to keep this issue up-to-date.

naparuba commented 10 years ago

We continue on the devel mailinlist as there will be more hep from others than here.