Store availability data for hosts

mohierf commented 9 years ago

NOTE still some fixes to be made ... do not use on production servers !

The module manage _host_checkresult broks to compute and store availability data for all known hosts on a daily basis.

For every day, a document is stored in the availability collection with following fields :

hostname/service
day (YYYY-MM-DD) and day_ts (timestamp representing day at 00:00)
first received check state and timestamp
last received check state and timestamp
period for 0 state (UP)
period for 1 state (DOWN)
period for 2 state (UNREACHABLE)
period for 3 state (UNKNOWN)
period for 4 state (UNCHECKED)
host has been in downtime : 0/1

The sum of the 5 stored periods is always 86400, as the number of seconds per day. Before the first received check, the host is considered as in an UNCHECKED period, as well as after the last received check.

The Shinken WebUI uses this data collection to display availability information for each host (see https://github.com/shinken-monitoring/mod-webui/issues/260).

mohierf commented 9 years ago

NOTE still some fixes to be made ... do not use on production servers !

maethor commented 9 years ago

To be sure I understand well. This collection is updated every time the mongo-logs get a new log for the hostname/service. So "period for UNCHECKED" is initialized to 86400, and decremented when we increment the others values? Am I right?

So when we query the availability from the WebUI, we only compute percentages of 86400?

Is it computing availability for all services, or only for hosts?

mohierf commented 9 years ago

You are right ... it is almost a real time information :-)

At the moment, I only implemented host checks but it will be reaaly simple to make it for all services.

I noticed some problems with this simple strategy :

you do not always get 100% of 86400 seconds, because first and last checks in the day are not received at 00:00 and 24:00 ... so you lose fews seconds every day!
you can not have availability information for periods smaller than a day

I have some ideas to cope with the first problem ... but I am not yet sure what is the best strategy ... to be discussed! @maethor

maethor commented 9 years ago

I plan to review entirely the source code of you plugin (to remove some if len(list) > 0:, for example :D), so in a few hours I will be happy to bring you some suggestion on the strategy :)

Availability for small period is quite hard. In fact, the best strategy to manage such things is the one used by perfdata databases. It consists in having precise information for the last hours, and then to aggregate the information more and more as the time goes. This is nice because we don't have to put any limit, and we are sure that the database size will not explode. But on the other hand, it can complexify a lot the implementation.

But I think I already have an idea to do this… :)

mohierf commented 9 years ago

Feel free to restart from scratch ...I simply made a moke-up to validate an idea that was to compute on the fly instead of parsing a big logs table in a database :-)

maethor commented 9 years ago

There is no need to restart from scratch. Your proof of concept is great :)

bittrance commented 8 years ago

What is the status of this feature? I see that building from latest that there is still no service-based availability in my mongo log. I am somewhat keen on implementing this. @mohierf, @maethor: any ideas/thoughts you want to share?

mohierf commented 8 years ago

@bittrance : as far as I remember (it's been quite a long time ...), you should have information for the hosts and the services.

The module log some information on start in the brokerd.log to inform about what it will manage. And you have some configuration parameters to include/exclude some services from the recording ... perharphs something to configure on your environment ?

I left this issue opened because @maethor had an idea for rewriting some part of the code.

bittrance commented 8 years ago

Indeed. Explicitly setting a serivces_filter resolves the issue. The text in the module config file says "default is to consider only the services which business impact is > 4". However, since services_filter is commented out in default config, https://github.com/shinken-monitoring/mod-mongo-logs/blob/master/module/module.py#L154 will actually leave filter_service_criticality unset, which means https://github.com/shinken-monitoring/mod-mongo-logs/blob/master/module/module.py#L373 will be bypassed. Which is right? should the default be services_filter = getattr(mod_conf, 'services_filter', 'bi:>=4') or should the docs in config file change?

mohierf commented 8 years ago

Because services_filter is commented out, it takes the default value defined in the source code and it is ... an empty string :(

You are right, we should change the doc in the configuration file !

shinken-monitoring / mod-mongo-logs

Store availability data for hosts #1