skywalka / splunk-for-nagios

Analytics for Nagios
GNU General Public License v3.0
22 stars 6 forks source link

Enhancement - Calculate SLA for services and hosts #3

Closed xkilian closed 10 years ago

xkilian commented 11 years ago

Calculate SLA and take into account downtimes. Calculate MTTR and acknowledgement delay during outages.

tfhartmann commented 11 years ago

What's the looks like? I think it's a great idea, bit not sure how to calculate that...

Sent from my iPhone

On Nov 8, 2012, at 10:02 PM, xkilian notifications@github.com wrote:

Calculate SLA and take into account downtimes. Calculate MTTR and acknowledgement delay during outages.

— Reply to this email directly or view it on GitHub.

skywalka commented 11 years ago

Apologies for closing this request, I clicked the wrong button! I have Reopened it :)

These would be (very) nice to have, I want to have a look at "Nagios Operations Center" and then incorporate some of their ninja skillz: http://splunk-base.splunk.com/apps/52020/nagios-operations-center

L.

skywalka commented 11 years ago

We could experiment with MK Livestatus to retrieve the SLA data: http://mathias-kettner.de/checkmk_livestatus.html

It becomes very difficult to do SLA calcs from the standard nagios log data as Splunk doesn't keep state natively whereas nagios does out of the box...

MK Livestatus could be the way around that :)

L.

tfhartmann commented 11 years ago

Yeah love the mk_livestatus integration actually, really awesome. I wonder if it's worth making mk_livestatus an input (maybe moduler? ) If not an input, I bet we could run a search to populate the lookups for pulldowns with mk_livestatus!

xkilian commented 11 years ago

Hmm... The data to do the SLA calculations is all in the logs. Thruk only uses the nagios log data to calculate the SLA. For downtimes, I have to check in the log, but it is my understanding that this data is also present.

See how Thruk handles the SLA calculations in their code. (It is Perl, but you will get the idea and algorithm, yeah for open-source!) EDIT: They use Livestatus to get the data from the Livestatus logstore and do the reporting from that.

You should NOT poll data from Livestatus other than for showing current state. Of course polling livestatus for SLA data directly would be nice but totally not meant for it. It would kill performance unless the SLAs would be pre-calculated in the background for pre-determined time-frames, ex. 1 day, 1 week, 1 month. But this would be a gross solution and not permit any fancy statistical munging which is the bread and butter of Splunk.

xkilian commented 11 years ago

I have created a broker module for Shinken which exports SERVICE and HOST logs (same format as nagios.log) to a raw TCP socket. TCP port 9514 by default. The universal forwarder would listen on this socket to process the data.

The SERVICE and HOST data includes state changes and downtimes.

You can find it at github xkilian/shinken branch syslog under shinken/modules/rawsocket_broker.py

I have not tested it yet.

xkilian commented 11 years ago

Still not tested, I will try and get around to it this week, but I have a whole fleet of monkeys on my back… sigh

skywalka commented 11 years ago

I have added a script to request a hosts' service SLA by accessing MK Livestatus...

Example usage: index=nagios src_host="eping.big-data.com.au" name="time"| head 1 | eval daysago=5 |dedup src_host,name | liveservicesla | stats max(liveservicesla) AS liveservicesla | eval liveservicesla=liveservicesla*100

Commit: https://github.com/skywalka/splunk-for-nagios/commit/254cc9065f5cc25811f223b7356c301c7eb675a3

L.

skywalka commented 11 years ago

New Commit: https://github.com/skywalka/splunk-for-nagios/commit/d40a1ad98bc958a42d93ddf8bb20f7ce812834da

FYI: the 'daysago' variable in the "Example usage" (above) gets parsed by the script for the SLA time window calculation for MK Livestatus :)

tfhartmann commented 11 years ago

Was looking at that today! Maybe make a macro with a couple of inputs so that we can pass form inputs to it?

Sent from my iPhone

On Dec 3, 2012, at 7:42 PM, Luke Harris notifications@github.com wrote:

New Commit: d40a1ad

FYI: the 'daysago' variable in the "Example usage" (above) gets parsed by the script for the SLA time window calculation for MK Livestatus :)

— Reply to this email directly or view it on GitHub.

skywalka commented 11 years ago

Will do :)

skywalka commented 11 years ago

Hey Tim :)

Here is the initial Livestatus Service SLA Dashboard... I didn't have any luck getting it to work with a macro, please work your mad crazy ninja skills :)

Commit: https://github.com/skywalka/splunk-for-nagios/commit/42de0093588181eccef2fbe28544825c505a67f4

skywalka commented 11 years ago

Here is the initial Livestatus Host SLA Dashboard and script:

Commit: https://github.com/skywalka/splunk-for-nagios/commit/812fe2c0a09dd4c1abcab68a8c9b22085acf5b69

skywalka commented 11 years ago

I have updated the Livestatus Service and Host SLA Dashboards, removing the redundant TimeRangePicker module:

Commit: https://github.com/skywalka/splunk-for-nagios/commit/90f4a9fe8c1908515b7a6254a037be8a1d58b3c2