Problems with mklivestatus if using mod_gearman

jvigna commented 14 years ago

When using mod_gearman with mklivestatus there is a problem that initializing first mklivestatus and later mod_gearman, mklivestatus is put into DEBUG mode and it creates HUGE logs nagios.log files. A temporary fix is to load the broker modules in inverted order. On an installation with 5000 servicechecks this creates a daily nagios.log file > 2G and effectively blocks nagios.

sni commented 14 years ago

I cannot reproduce that. I use livestatus and mod_gearman together with no problems. No matter of the load order. Which versions are you using?

jvigna commented 14 years ago

I use nagios 3.2.3 mklivestatus 1.1.6 and the latest mod_gearman package.

sni commented 14 years ago

I just tried the 3.2.3 with the 1.1.6 and still cannot reproduce. Could you try to Update Livestatus to the 1.1.8? And if that does not help, could i have a look at your nagios.cfg?

jvigna commented 14 years ago

With this order I have the problem (still with 1.1.6):

broker_module=/usr/lib/nagios/extentions/mod_gearman.o server=localhost:4730 eventhandler=yes services=yes hosts=yes key=neteye broker_module=/usr/lib/nagios/extentions/livestatus.o /var/log/nagios/rw/live event_broker_options=-1

With this order all seems ok:

broker_module=/usr/lib/nagios/extentions/livestatus.o /var/log/nagios/rw/live broker_module=/usr/lib/nagios/extentions/mod_gearman.o server=localhost:4730 eventhandler=yes services=yes hosts=yes key=neteye cfg_dir=/etc/nagios/neteye/plugins event_broker_options=-1

I have this also on my test server I just set up. You surely do use livestatus? You only have log entries of livestatus inside nagios.log if you are doing some queries over it, f.ex. use NagVis configured with livestatus backend.

Thanks anyway for your fast feedback. I like this addition to Nagios a Lot and it is a really great help on "big" installations. Maybe just one question why did you not make it multithreaded, instead of spaning processes, with my 5000 servicechecks I've seen I need more or less 100-150 instances and this means that instead of 150 processes (checks) one has 300 processes running. This question just as info I still think the idea which is behind is just GREAT!

sni commented 14 years ago

1) No threads, because you would have to fork anyway and its not a good idea to mix threads and forks. Then, forks make it way easier to kill plugins running into timeout. Then i think there is not a big difference between threads and forks for your system besides make ps output difficult to read. And you would have 150 nagios processes running and 150 plugins, So that should be the same with just the difference that the gearman worker takes 1mb rss and my nagios process ~75mb which makes it faster to fork.

2) i still cannot reproduce. I did a "git co v1.1.6", rebuild livestatus and i did some querys as i use Thruk as alternative Nagios Gui which uses livestatus too. Could you try to update to the latest livestatus and mod_gearman version?

jvigna commented 14 years ago

1) OK that seems a good enough explanation to me ;)

2) I updated to 1.1.6p1 (latest stable release of mklivestatus) and are already using v0.6 of mod_gearman (that seems to me be the last version too, isn't it?). I still have the same problem just strange are the DEBUG messages I see there seems to be something really strange going on :(

[1287410319] livestatus: nebmodule_init(0, 138763968)
[1287410319] livestatus: running on libgearman 0.14
[1287410319] livestatus: create_gearman_client()
[1287410319] livestatus: finished initializing
[1287410319] Event broker module '/usr/lib/nagios/extentions/mod_gearman.o' initialized successfully.
[1287410319] livestatus: handle_process_events(7, data)
[1287410319] livestatus: handle_process_events(7, data)
[1287410319] livestatus: Starting 10 client threads
[1287410319] livestatus: handle_timed_events(8, data)
[1287410319] livestatus: Entering main loop, listening on UNIX socket. PID is 26024
[1287410323] livestatus: handle_process_events(7, data)
[1287410323] livestatus: registered neb callbacks
[1287410323] livestatus: worker 1 started
[1287410323] livestatus: started result_worker thread for queue: check_results
[1287410329] livestatus: handle_timed_events(8, data)
[1287410329] livestatus: handle_timed_events(8, data)

sni commented 14 years ago

The logged messages are mostly from mod_gearman. But seems like mod_gearman uses the logger from livestatus. May i see the line befor the nebmodule_init line? There should be a line telling the used mod_gearman options.

jvigna commented 14 years ago

Here it comes the whole log form init till some "unwanted" DEBUG output:

[1287416491] Nagios 3.2.3 starting... (PID=26024)
[1287416491] Local time is Mon Oct 18 17:41:31 CEST 2010
[1287416491] LOG VERSION: 2.0
[1287416491] livestatus: Version 1.1.6p1 initializing. Socket path: '/var/log/nagios/rw/live'
[1287416491] livestatus: Livestatus has been brought to you by Mathias Kettner
[1287416491] livestatus: Please visit us at http://mathias-kettner.de/
[1287416491] livestatus: Created UNIX control socket at /var/log/nagios/rw/live
[1287416491] livestatus: Opened UNIX socket /var/log/nagios/rw/live
[1287416491] livestatus: successfully finished initialization
[1287416491] Event broker module '/usr/lib/nagios/extentions/livestatus.o' initialized successfully.
[1287416491] livestatus: parse_args_line(server=localhost:4730, 0)
[1287416491] livestatus: parse_args_line(eventhandler=yes, 0)
[1287416491] livestatus: parse_args_line(services=yes, 0)
[1287416491] livestatus: parse_args_line(hosts=yes, 0)
[1287416491] livestatus: parse_args_line(key=neteye, 0)
[1287416491] livestatus: args: server=localhost:4730 eventhandler=yes services=yes hosts=yes key=neteye
[1287416491] livestatus: nebmodule_init(0, 138763968)
[1287416491] livestatus: running on libgearman 0.14
[1287416491] livestatus: create_gearman_client()
[1287416491] livestatus: finished initializing
[1287416491] Event broker module '/usr/lib/nagios/extentions/mod_gearman.o' initialized successfully.
[1287416491] livestatus: handle_process_events(7, data)
[1287416491] livestatus: handle_process_events(7, data)
[1287416491] livestatus: Starting 10 client threads
[1287416491] livestatus: Entering main loop, listening on UNIX socket. PID is 26024
[1287416491] livestatus: handle_timed_events(8, data)
[1287416495] livestatus: handle_process_events(7, data)
[1287416495] livestatus: registered neb callbacks
[1287416495] livestatus: worker 1 started
[1287416495] livestatus: started result_worker thread for queue: check_results
[1287416501] livestatus: handle_timed_events(8, data)
[1287416501] livestatus: handle_timed_events(8, data)
[1287416503] livestatus: handle_svc_check(13, data)
[1287416503] livestatus: received job for queue service: MsgConsole - MsgConsole: Dummy
[1287416503] livestatus: cmd_line: /usr/lib/nagios/plugins/check_msgconsole_severity -i dummy
[1287416503] livestatus: add_job_to_queue(service, MsgConsole-MsgConsole: Dummy, 1, 1, 1)
[1287416503] livestatus: 213 --->type=service
result_queue=check_results
host_name=MsgConsole
service_description=MsgConsole: Dummy
start_time=1287416503.149772
timeout=60
command_line=/usr/lib/nagios/plugins/check_msgconsole_severity -i dummy

<---

sni commented 14 years ago

This is really strange. Seems like mod_gearman uses the livestatus logger. The output is from mod_gearman, but the "livestatus:" prefix is from the livestatus logger. Could you try the latest git version of mod_gearman? I renamed the gearman logger.

jvigna commented 14 years ago

Do you have a nightly build tar source package?

sni commented 14 years ago

No, the source packages is the same like when you do a "git clone ..." and a "make dist".

jvigna commented 14 years ago

I just asked because I'm behind a firewall right now and could only download over http ;), I will try it later and let you have feedback.

sni commented 14 years ago

You can download the latest HEAD Archive from github too. Go too http://github.com/sni/mod_gearman and see the Download link on the top right.

jvigna commented 14 years ago

That did the trick #:O), now with this 0.7 version the problem disappears. One small question maybe you can help me, why do I see sort of bulk service-check processing in the gearman_top, that means that I see ~30 in Running and 0 or few in waiting state and then there are time I see ~1000 in waiting state, is the nagios scheduler broken or is it me not knowing what options to set?

Anyway thanks a lot for your fast help, when do you plan to release this fix? Is there anything other new/untested in 0.7git?

Here for you the log now:

[1287489137] Nagios 3.2.3 starting... (PID=7011)
[1287489137] Local time is Tue Oct 19 13:52:17 CEST 2010
[1287489137] LOG VERSION: 2.0
[1287489137] livestatus: Version 1.1.6p1 initializing. Socket path: '/var/log/nagios/rw/live'
[1287489137] livestatus: Livestatus has been brought to you by Mathias Kettner
[1287489137] livestatus: Please visit us at http://mathias-kettner.de/
[1287489137] livestatus: Created UNIX control socket at /var/log/nagios/rw/live
[1287489137] livestatus: Opened UNIX socket /var/log/nagios/rw/live
[1287489137] livestatus: successfully finished initialization
[1287489137] Event broker module '/usr/lib/nagios/extentions/livestatus.o' initialized successfully.
[1287489137] mod_gearman: Version 0.7
[1287489137] mod_gearman: encryption key should be at least 8 bytes!
[1287489137] Event broker module '/usr/lib/nagios/extentions/mod_gearman.o' initialized successfully.
[1287489137] Finished daemonizing... (New PID=7012)
[1287489137] livestatus: Starting 10 client threads
[1287489137] livestatus: Entering main loop, listening on UNIX socket. PID is 7012

sni commented 14 years ago

The nagios scheduler is not really broken. Maybe a little bit :-) Let me give you an example. Imagine a timeperiod 6x13 with starts on monday at 7am. Then nagios will schedule every single check in this timeperiod for monday morning 7am. And from then on, it will schedule them every x minutes at the same time. Even these waves are flattended a little bit over the day, it starts over at the next day at 7am. There is currently no release plan for the 0.7. The 0.6 is pretty stable and i am currently waiting for feedback :-) Maybe next week or so. I wil announce it on the labs.consol.de blog, because the 0.7 will probably the "production ready" version. I you have a look at the changelog, there haven't been major changes in the 0.7, just a few fixed issues, nothing to worry about. May i ask you what kind of os you are running? Anything special? Old gcc or something like that?

jvigna commented 14 years ago

Then you think I can use this version already in production environment?

I'm using CentOS 5.5 32bit I don't think too old ;)

Most of my checks are 24x7 and what I see is an up and down inside the 5 Minute scheduling, as if nagios tries to force a check on a lot of services in the same timeframe, but I will investigate better next time I'm on site.

sni commented 14 years ago

Its not more or less stable than the 0.6 :-) Its still not the 1.0, so at least you should add proper monitoring for the gearman infrastructure. Adding Checks for the gearmand server and the worker should be enough. And of course, add these checks to a a local servicegroup, so they will be checked by nagios directly. I am using the 0.7 already in production.

sni commented 14 years ago

seems to be fixed now

jvigna commented 14 years ago

Yes with latest version I do not have anymore this type of problems.

Thanks a lot!

sni / mod_gearman

Problems with mklivestatus if using mod_gearman #3