Closed dhoffend closed 6 years ago
Hi @dhoffend, sorry for my late response...
Unfortunately this runs straight into an segmentation fault on my test system. Ubuntu Xenial, Naemon 1.0.6
Naemon Core 1.0.6-source
Copyright (c) 2013-present Naemon Core Development Team and Community Contributors
Copyright (c) 2009-2013 Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
License: GPL
Website: http://www.naemon.org
Naemon 1.0.6-source starting... (PID=9668)
Local time is Fri Jun 29 18:42:49 CEST 2018
qh: Socket '/opt/naemon/var/naemon.qh' successfully initialized
nerd: Channel hostchecks registered successfully
nerd: Channel servicechecks registered successfully
nerd: Fully initialized and ready to rock!
wproc: Successfully registered manager as @wproc with query handler
wproc: Registry request: name=Core Worker 9672;pid=9672
wproc: Registry request: name=Core Worker 9671;pid=9671
wproc: Registry request: name=Core Worker 9670;pid=9670
wproc: Registry request: name=Core Worker 9669;pid=9669
statusengine: the missing event broker
statusengine: Copyright (c) 2014 - present Daniel Ziegler <daniel@statusengine.org>
statusengine: Please visit https://www.statusengine.org for more information
statusengine: Contribute to Statusenigne at: https://github.com/nook24/statusengine
statusengine: Thanks for using Statusengine :-)
statusengine: Gearman server address list changed: localhost:4731
statusengine: add gearmand server[0] localhost:4731
statusengine: Register callbacks
Speicherzugriffsfehler (Speicherabzug geschrieben)
This happends if I load the broker without arguments, or of I add gearman_server_list
like:
broker_module=/opt/dhoffend/statusengine-1-0-5.o gearman_server_list=localhost:4731
Same with Nagios 4.4.0
Nagios Core 4.4.0
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 2018-01-22
License: GPL
Website: https://www.nagios.org
Nagios 4.4.0 starting... (PID=10400)
Local time is Fri Jun 29 18:48:26 CEST 2018
wproc: Successfully registered manager as @wproc with query handler
wproc: Registry request: name=Core Worker 10404;pid=10404
wproc: Registry request: name=Core Worker 10403;pid=10403
wproc: Registry request: name=Core Worker 10402;pid=10402
wproc: Registry request: name=Core Worker 10401;pid=10401
Speicherzugriffsfehler (Speicherabzug geschrieben)
That’s a bit strange as it worked for me. I’ve tested it with the omd Naemon Version and Naemon 1.0.7-Source/ Master Version in Debian 9. I did not had any defaults and it loaded without errors.
I’ll try my best to test ist again. But I’m not the „c“ expert.
Okay. I found something. It worked for me quite well cause I was using only enable_ochp/ocsp in the tests and disabled everything else.
I just tested it with all options turned on and it segfaulted. Then I disabled them one by one. In the end I found the Option "use_log_data=1" to be the cause. When I disabled use_log_data=0 it, everything seems to work fine. I can't find the cause on the first view.
Here's the backtrace. It has something to do with the log job that gets put into the log queue
Simple Config
broker_module=/opt/statusengine/bin/naemon/statusengine-1-0-5.o use_log_data=1
Run with gdb
(gdb) run
Starting program: naemon-dbg naemon.cfg
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Naemon Core 1.0.7.source
Copyright (c) 2013-present Naemon Core Development Team and Community Contributors
Copyright (c) 2009-2013 Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
License: GPL
Website: http://www.naemon.org
Naemon 1.0.7.source starting... (PID=4365)
Local time is Mon Jul 02 00:15:33 CEST 2018
qh: Socket '/opt/openitc/nagios/var/naemon.qh' successfully initialized
nerd: Channel hostchecks registered successfully
nerd: Channel servicechecks registered successfully
nerd: Fully initialized and ready to rock!
statusengine: the missing event broker
statusengine: Copyright (c) 2014 - present Daniel Ziegler <daniel@statusengine.org>
statusengine: Please visit https://www.statusengine.org for more information
statusengine: Contribute to Statusenigne at: https://github.com/nook24/statusengine
statusengine: Thanks for using Statusengine :-)
statusengine: start with disabled log_data
statusengine: Register callbacks
statusengine: add gearmand server[0] 127.0.0.1:4730
Program received signal SIGSEGV, Segmentation fault.
gearman_task_internal_create (client=client@entry=0x555555774a00, task=0x7ffffffeddf0) at libgearman/task.cc:103
103 client->task_list->prev= task;
(gdb) bt
#0 gearman_task_internal_create (client=client@entry=0x555555774a00, task=0x7ffffffeddf0) at libgearman/task.cc:103
#1 0x00007ffff626b45f in add_task (client=..., task=task@entry=0x7ffffffeddf0, context=context@entry=0x555555774a00, command=command@entry=GEARMAN_COMMAND_SUBMIT_JOB_BG, function=..., unique=...,
workload=..., when=0, actions=...) at libgearman/add.cc:203
#2 0x00007ffff626ddce in _client_do_background (client=0x555555774a00, command=GEARMAN_COMMAND_SUBMIT_JOB_BG, function=..., unique=..., workload=..., job_handle=0x0) at libgearman/client.cc:257
#3 0x00007ffff626dfaa in gearman_client_do_background (client=0x555555774a00, function_name=<optimized out>, unique=0x0, workload_str=0x55555590b180, workload_size=178, job_handle=0x0)
at libgearman/client.cc:734
#4 0x00007ffff68d87a6 in statusengine_send_job () from /opt/statusengine/bin/naemon/statusengine-1-0-5.o
#5 0x00007ffff68dc6ef in statusengine_handle_data () from /opt/statusengine/bin/naemon/statusengine-1-0-5.o
#6 0x00007ffff7b655ca in neb_make_callbacks_full () from /usr/lib/naemon/libnaemon.so.0
#7 0x00007ffff7b65654 in neb_make_callbacks () from /usr/lib/naemon/libnaemon.so.0
#8 0x00007ffff7b45430 in broker_log_data () from /usr/lib/naemon/libnaemon.so.0
#9 0x00007ffff7b5f2bf in ?? () from /usr/lib/naemon/libnaemon.so.0
#10 0x00007ffff7b5f41b in nm_log () from /usr/lib/naemon/libnaemon.so.0
#11 0x00007ffff68d8491 in logswitch () from /opt/statusengine/bin/naemon/statusengine-1-0-5.o
#12 0x00007ffff68d8ed0 in nebmodule_init () from /opt/statusengine/bin/naemon/statusengine-1-0-5.o
#13 0x00007ffff7b6524b in neb_load_module () from /usr/lib/naemon/libnaemon.so.0
#14 0x00007ffff7b653d8 in neb_load_all_modules () from /usr/lib/naemon/libnaemon.so.0
#15 0x00005555555574e1 in main (argc=<optimized out>, argv=<optimized out>) at src/naemon/naemon.c:573
Okay, got one ... one more to go. It looks like the module has a a problem when the gearman server is unavailable and you create a log message (logging the connection error) it triggers a another broker event that wants to you put a message into the log queue ... this somehow creates log loops.
I've played around to ignore creating error logs when we're handling log_data events ... but this requires more thoughts. We wanna have logs (in the file) but not creating a log loop
Oh yes, I remember. The recursive log message issue is the reason, why the broker run into a segfault if you stop the gearman job server...
Okay then I’m on the right track. The first issue is as that the gearman client must be created before registering the eventhandler. The next step will be a variable to block log messages while processing logentries otherwise you end up with a segfault.
I’ll push a proposal later to round this request up
Hi @dhoffend. unfortunately I found a minor bug. If you load the broker with one gearman server as argument, Naemon will stuck in the broker initialize and not execute any checks:
broker_module=/opt/dhoffend/statusengine-1-0-5.o gearman_server_list=localhost:4730
Also the line Successfully launched command file worker with pid 2304
is missing in the Naemon output. The only way to kill Naemon, was kill -9
, so I guess the broker is some where in an endless loop?
If I load the broker without any arguments, it seems to work.
Ok, the issue only occurs, if I use localhost
instead of 127.0.0.1
. So i guess this issue would also happened with the old argument gearman_server_addrn
I came across the same bug while doing tests on Friday. I thought it would be my test environment. But changing to IP did the job ...
I haven’t checked it yet if gearman_client is doing name lookups every job ... it is a bit strange.
Maybe there’s a difference between gearman_client_add_server and gearman_client_add_servers.
I came across a memory leak bug. While running 60k checks I could monitor how naemon was consuming more and more memory. After 3 days the process it was consuming 8gb and more. It was reproduce able.
First I was search my patches but finally I found out that the variable raw_command was allocated but never freed up. This bug also exists in the master branch.
Edit/Update: The memory leak was rarely noticed, because when naemon gets restarted regularly to activate a new configuration or had a low number of checks the memory consumption was quite okay ... but with huge number of checks the memory increase was quite noticeable.
Hi @dhoffend, many thanks for your fix! I really have to apologize that your pull request is still not merged :( I found myself stuck in the "Sommerloch" and don't touch PCs in my spare time at the moment.
I will definitely take a look at your PR soon and merge it into the master!
I checked the localhost
vs 127.0.0.1
issue. This is also broker when using gearman_client_add_server
instead of gearman_client_add_servers
so ¯_(ツ)_/¯
Moved into a new PR to apply some fixes: https://github.com/statusengine/module/pull/2
Hi @dhoffend,
your PR (#2) was merged :) Updated documentation: https://statusengine.org/broker/#broker-options
Changelog: https://statusengine.org/roadmap/#module-3.1.0
Thanks for your support!
Fixes:
Changes:
Example: Send duplicates to 2 gearmand:
Example: have a failover gearmand:
This also works combined (whatever):