naemon / naemon-core

Networks, Applications and Event Monitor
http://www.naemon.io/
GNU General Public License v2.0
154 stars 63 forks source link

shadownaemon : in_notification_period and in_check_period #113

Closed fledorze closed 8 years ago

fledorze commented 9 years ago

Just installed naemon 1.0.4-20150509. With shadownaemon, in_notification_period and in_check_period are always 0. If a I desactivated shadownaemon, in_notification_period and in_check_period are equal to 1, except for hosts or services that are really out of notification or check period on the backend.

sni commented 9 years ago

hmm, thats true. Timeperiod information cannot be retrieved by livestatus, so its not possible for the shadownaemon to calculation the timeperiods on its own.

fledorze commented 9 years ago

Hi Sven, I don't understand why shadownaemon should calculate something. Even when shadownaemon is desactivated, livestatus is used to get informations from backends, and in_notification_period/in_check_period are the right ones. When I inspect livestatus colums, I find in_notification_period and in_check_period parameters, which are just boolean requiring no calculation. I am wrong ? As we you use Thruk filters to exclude from our NOC view hosts/services that are outside their check or notification period (for example when a customer did not bought our off-hours service), it would make shadownaemon unusable for us. Thx

sni commented 9 years ago

They are calculated every minute into a cache, thats why its just a boolean. But its in fact a calculation based on the timeperiod data which cannot be done because we don't have the timeperiod configuration data. But maybe its possible to directly set the timeperiod cache based on querys from the remote system.

fledorze commented 9 years ago

Not sure to get you again :do you mean that you don't trust the information known (that is : calculated if I undestand well) by remote backend and reachable by livetstatus request but you make it calculated by shadownaemon instead ? But to do that, you would need not only the timeperiod name, but its configuration (included and excluded timeranges) which does not seem to appear in livestatus data. But why not just trust what remote backend says ?

sni commented 9 years ago

Thats right and the reason why shadownaemon cannot calculate the current state itself. Shadownaemon could fetch the remote data but i've found no way to inject that data back into Livestatus because Livestatus uses a C++ map type cache and i have no idea how to manipulate that data with Shadownaemon. Maybe a more experienced c programmer is able to do that, but thats way beyond my C skill level.

fledorze commented 9 years ago

May be I'm a bit silly but I still not understand why you do not just get in_notification_period and in_check_period parameters of hosts or services from remote backend through livestatus, just like Thruk does for other parameters.

sni commented 9 years ago

well, almost all livestatus database fields directly map to object attributes, but some, like this ones, are calculated. Sure we can fetch in_notification_period from the remote side, but what should be done with that information. There is no attribute in naemon which can be updated with that information.

fledorze commented 9 years ago

Understood !! My brain is so slow sometimes. It explains the difference when shadownaemon is used and it is not used, of course. I won't be able to help you concerning C++ in livestatus. But without reliable in_notification_period and in_check_period parameters from remote backends, we won't be able to use shadownaemon, for thr reason I explained above. For the moment, with have 13 backends and around 22000 hosts/services. But we initially planned to add many more. Will Thruk support the load with shadownaemon ? Difficult to say.

sni commented 9 years ago

It makes no difference for Thruk, its just slower. With shadownaemon we operate Thruk instances with more than 200 remote cores. Without shadownaemon i know at least Thruks running with around 70 cores connected.

fledorze commented 9 years ago

That's is my fear : performance degradation with too many backends. For the moment, performances are acceptable. Furthermore, Thruk is regularely stucked when a backend is sick or needed to be restarted, or possibly for other unkwown reason (cf my post "Thruk Naemon/Thruk server completely stucked"). It does not happen with shadownaemon which seems to smooth problems. If you have any advice to improve performance an reliability, I would be great. Our central naemon is a VMware virtual machine with 2 vCPU and 2 Gb RAM.

fledorze commented 9 years ago

One question again (probably a sillly one again ..), If timeperiods data where downloaded from backends to thruk server by another way, would shadownaemon be able to calculate the in_notification_period and in_check_period parameters properly ? I suppose it would require to dot it by considering the backend time, not the local time, as backends may be in different time zone. One this other side, is there any plan to add time period data in livestatus ?

sni commented 9 years ago

yes, it could work, if the timeperiod definitions would be available somehow. And no, i don't know any plans to add that directly to livestatus. Timeperiods can be nested and cascaded with includes and excludes. I don't think there is an easy way to cover that with livestatus.

fledorze commented 9 years ago

It's not a problem for me to download regularely by SSH (but not very often as they change rarely) the complete timeperiod definitions from backends and put them in a /var/cache/naemon/backend_id/tmp/timeperiods.cfg file for example. But how to merge them with /var/cache/naemon/backend_id/tmp/objects.cfg which is regularely updated by shadownaemon and already contain time periods objects whithout their definitions ? I suppose shadownaemon code needs to be change a little, right ? May be a mechanism allowing local definitions of objects ? And what about the timezone problem ?

sni commented 9 years ago

Exactly, thats something which had to be written first. Something like a new option for a timeperiods file.

fledorze commented 9 years ago

I have rarely developed in C but I will try to have a look. For me, if you plan to promote shadownaemon, it needs to be fixed. Because actually, it gives false in_notification_period and in_check_period. But it much performant and is the way to go in our context. The PNP4Nagios CPU graphs between the classic Thruk and the shadownameon show it obviously.

sni commented 9 years ago

yes, but i am not going to promote it. I wrote it once because it was a requirement for a project but i don't have the time to maintain it. I still made it opensource, so others could continue or provide patches and don't have to start at zero.

fledorze commented 9 years ago

I understand. For sure, shawdownaemons validates the idea of a cache system to smooth network slowness between central server and its backends, just like us and our IPSEC tunnels. But probably be to be thought differently. Do you have other plans to use such a cache system ?

sni commented 8 years ago

Please have a look at https://github.com/sni/lmd This fixes all shortcomings of shadownaemon and i will deprecate shadownaemon in favor of lmd.