pythian / opsviz

26 stars 24 forks source link

Agile health checks #50

Open lesaux opened 9 years ago

lesaux commented 9 years ago

I have a different approach when it comes to creating sensu checks. It's not really an issue, just a different way of doing things so thought I'd mention it here. I usually install a graphite client on all the servers I am monitoring (when possible, and I prefer Diamond). I then create the sensu checks to verify data against graphite metrics - You'll say that I am hammering graphite with a bunch of queries, and I am aware of this - until now this hasn't been an issue in environments with around 150 boxes. But this is why I say it is more agile: I can leverage graphite math functions to try and flatten anomalies and try to find only the relevant signal in the noise. I then use the check-data.rb script for sensu-community-plugins this way:

/etc/sensu/plugins/check-data.rb -a 120 -s ${graphite_host} -t 'minSeries($graphite_prefix.hostname -s.diskspace.*.byte_percentfree)' -w :::params.graphite.diskspace.bytes.free.warning|20::: -c :::params.graphite.diskspace.bytes.free.critical|10:::

minSeries() might not be the best option here, but this is just an example.

lainevcampbell commented 9 years ago

Do you have concerns about latency in getting metrics to graphite storage? I definitely like getting everything in one collection and doing alerting, anomaly detection/analytics and visualization from there, but I get concerned on latency.

I assume that graphite functions are still assuming normal (gaussian) distributions, which most of the data probably is not, but that this solution is a better choice then simply using the raw data, correct?

On Tue, Mar 24, 2015 at 9:06 AM, Pierig Le Saux notifications@github.com wrote:

I have a different approach when it comes to creating sensu checks. It's not really an issue, just a different way of doing things so thought I'd mention it here. I usually install a graphite client on all the servers I am monitoring (when possible, and I prefer Diamond). I then create the sensu checks to verify data against graphite metrics - You'll say that I am hammering graphite with a bunch of queries, and I am aware of this - until now this hasn't been an issue in environments with around 150 boxes. But this is why I say it is more agile: I can leverage graphite math functions to try and flatten anomalies and try to find only the relevant signal in the noise. I then use the check-data.rb script for sensu-community-plugins this way:

/etc/sensu/plugins/check-data.rb -a 120 -s ${graphite_host} -t 'minSeries($graphite_prefix.hostname -s.diskspace.*.byte_percentfree)' -w :::params.graphite.diskspace.bytes.free.warning|20::: -c :::params.graphite.diskspace.bytes.free.critical|10:::

minSeries() might not be the best option here, but this is just an example.

— Reply to this email directly or view it on GitHub https://github.com/pythian/opsviz/issues/50.

Laine Campbell

Co-Founder, AVP Open Source Database Practice

Pythian - Love your data

lcampbell@pythian.com | Twitter: @lainevcampbelll

Tel: 415.595.5719

www.pythian.com http://www.pythian.com/

lesaux commented 9 years ago

Latency of getting metrics into graphite is not an issue as far as I can tell. The Diamond intervals for shipping metrics works well. Reading data from graphite is not an issue either, and in fact the check-data.rb can alert if data in graphite is older than a configurable value. For the distribution of data, you are right. "Flattening" data with graphite functions only gets us partly there, and I consider it to be a temporary solution, until I have time to investigate other products such as Skyline (etsy).

This talk by Toufic Boubez is really interesting on the matter: http://www.slideshare.net/tboubez/5-things-i-learned-toufic-boubez-metafor-lisa2014

bfraser commented 9 years ago

Agreed, I think latency is less of a concern than, say, Sensu being unable to retrieve metrics from Graphite and the Graphite solution becoming a SPoF for the entire monitoring system. It would be simpler and reduce complexity if we had the ability to alert off of the Graphite data directly, using something like seyren or tattle (just an example).

Ultimately, I think it makes sense to move more towards a single source of truth / single pane of glass and not have to consult multiple systems / dashboards for different views of essentially the same (or slightly different) data. There are some newer projects such as Prometheus and Bosun that try to tackle these problems.

On the matter of normal (gaussian) distribution of data, you raise a very valid point. As Pierig mentioned, I feel that making use of the various math functions available in Graphite is definitely a step in the right direction, and moves us further towards the end goal. It's much better than having arbitrarily defined static thresholds in Sensu, with no historical data to leverage with which to base decisions on.

Ultimately, it would be nice to be able to apply nonparametric tests such as Kolmogorov-Smirnov (KS) against the data as is done with a tool such as Skyline referenced by Pierig. This gets us even closer to the goal.

alexlovelltroy commented 9 years ago

why is there no way on github to give you all a hug? I love this conversation and am reading more to try and have something of value to add.