statusengine / module

Repository of the Statusengine Event Broker Module
https://statusengine.org/broker/#overview
GNU General Public License v2.0
8 stars 4 forks source link

Some fixes and multiple gearmand server support #1

Closed dhoffend closed 6 years ago

dhoffend commented 6 years ago

Fixes:

Changes:

Example: Send duplicates to 2 gearmand:

gearman_server_list=localhost:4730;localhost:4731

Example: have a failover gearmand:

gearman_server_list=localhost:4731,localhost:4730

This also works combined (whatever):

gearman_server_list=localhost:4731,localhost:4730;localhost:4733,localhost:4732
nook24 commented 6 years ago

Hi @dhoffend, sorry for my late response...

Unfortunately this runs straight into an segmentation fault on my test system. Ubuntu Xenial, Naemon 1.0.6

Naemon Core 1.0.6-source
Copyright (c) 2013-present Naemon Core Development Team and Community Contributors
Copyright (c) 2009-2013 Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
License: GPL

Website: http://www.naemon.org
Naemon 1.0.6-source starting... (PID=9668)
Local time is Fri Jun 29 18:42:49 CEST 2018
qh: Socket '/opt/naemon/var/naemon.qh' successfully initialized
nerd: Channel hostchecks registered successfully
nerd: Channel servicechecks registered successfully
nerd: Fully initialized and ready to rock!
wproc: Successfully registered manager as @wproc with query handler
wproc: Registry request: name=Core Worker 9672;pid=9672
wproc: Registry request: name=Core Worker 9671;pid=9671
wproc: Registry request: name=Core Worker 9670;pid=9670
wproc: Registry request: name=Core Worker 9669;pid=9669
statusengine: the missing event broker
statusengine: Copyright (c) 2014 - present Daniel Ziegler <daniel@statusengine.org>
statusengine: Please visit https://www.statusengine.org for more information
statusengine: Contribute to Statusenigne at: https://github.com/nook24/statusengine
statusengine: Thanks for using Statusengine :-)
statusengine: Gearman server address list changed: localhost:4731
statusengine: add gearmand server[0] localhost:4731
statusengine: Register callbacks
Speicherzugriffsfehler (Speicherabzug geschrieben)

This happends if I load the broker without arguments, or of I add gearman_server_list like:

broker_module=/opt/dhoffend/statusengine-1-0-5.o gearman_server_list=localhost:4731

Same with Nagios 4.4.0

Nagios Core 4.4.0
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 2018-01-22
License: GPL

Website: https://www.nagios.org
Nagios 4.4.0 starting... (PID=10400)
Local time is Fri Jun 29 18:48:26 CEST 2018
wproc: Successfully registered manager as @wproc with query handler
wproc: Registry request: name=Core Worker 10404;pid=10404
wproc: Registry request: name=Core Worker 10403;pid=10403
wproc: Registry request: name=Core Worker 10402;pid=10402
wproc: Registry request: name=Core Worker 10401;pid=10401
Speicherzugriffsfehler (Speicherabzug geschrieben)
dhoffend commented 6 years ago

That’s a bit strange as it worked for me. I’ve tested it with the omd Naemon Version and Naemon 1.0.7-Source/ Master Version in Debian 9. I did not had any defaults and it loaded without errors.

I’ll try my best to test ist again. But I’m not the „c“ expert.

dhoffend commented 6 years ago

Okay. I found something. It worked for me quite well cause I was using only enable_ochp/ocsp in the tests and disabled everything else.

I just tested it with all options turned on and it segfaulted. Then I disabled them one by one. In the end I found the Option "use_log_data=1" to be the cause. When I disabled use_log_data=0 it, everything seems to work fine. I can't find the cause on the first view.

dhoffend commented 6 years ago

Here's the backtrace. It has something to do with the log job that gets put into the log queue

Simple Config

broker_module=/opt/statusengine/bin/naemon/statusengine-1-0-5.o use_log_data=1

Run with gdb

(gdb) run
Starting program: naemon-dbg naemon.cfg
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Naemon Core 1.0.7.source
Copyright (c) 2013-present Naemon Core Development Team and Community Contributors
Copyright (c) 2009-2013 Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
License: GPL

Website: http://www.naemon.org
Naemon 1.0.7.source starting... (PID=4365)
Local time is Mon Jul 02 00:15:33 CEST 2018
qh: Socket '/opt/openitc/nagios/var/naemon.qh' successfully initialized
nerd: Channel hostchecks registered successfully
nerd: Channel servicechecks registered successfully
nerd: Fully initialized and ready to rock!
statusengine: the missing event broker
statusengine: Copyright (c) 2014 - present Daniel Ziegler <daniel@statusengine.org>
statusengine: Please visit https://www.statusengine.org for more information
statusengine: Contribute to Statusenigne at: https://github.com/nook24/statusengine
statusengine: Thanks for using Statusengine :-)
statusengine: start with disabled log_data
statusengine: Register callbacks
statusengine: add gearmand server[0] 127.0.0.1:4730

Program received signal SIGSEGV, Segmentation fault.
gearman_task_internal_create (client=client@entry=0x555555774a00, task=0x7ffffffeddf0) at libgearman/task.cc:103
103       client->task_list->prev= task;
(gdb) bt
#0  gearman_task_internal_create (client=client@entry=0x555555774a00, task=0x7ffffffeddf0) at libgearman/task.cc:103
#1  0x00007ffff626b45f in add_task (client=..., task=task@entry=0x7ffffffeddf0, context=context@entry=0x555555774a00, command=command@entry=GEARMAN_COMMAND_SUBMIT_JOB_BG, function=..., unique=..., 
    workload=..., when=0, actions=...) at libgearman/add.cc:203
#2  0x00007ffff626ddce in _client_do_background (client=0x555555774a00, command=GEARMAN_COMMAND_SUBMIT_JOB_BG, function=..., unique=..., workload=..., job_handle=0x0) at libgearman/client.cc:257
#3  0x00007ffff626dfaa in gearman_client_do_background (client=0x555555774a00, function_name=<optimized out>, unique=0x0, workload_str=0x55555590b180, workload_size=178, job_handle=0x0)
    at libgearman/client.cc:734
#4  0x00007ffff68d87a6 in statusengine_send_job () from /opt/statusengine/bin/naemon/statusengine-1-0-5.o
#5  0x00007ffff68dc6ef in statusengine_handle_data () from /opt/statusengine/bin/naemon/statusengine-1-0-5.o
#6  0x00007ffff7b655ca in neb_make_callbacks_full () from /usr/lib/naemon/libnaemon.so.0
#7  0x00007ffff7b65654 in neb_make_callbacks () from /usr/lib/naemon/libnaemon.so.0
#8  0x00007ffff7b45430 in broker_log_data () from /usr/lib/naemon/libnaemon.so.0
#9  0x00007ffff7b5f2bf in ?? () from /usr/lib/naemon/libnaemon.so.0
#10 0x00007ffff7b5f41b in nm_log () from /usr/lib/naemon/libnaemon.so.0
#11 0x00007ffff68d8491 in logswitch () from /opt/statusengine/bin/naemon/statusengine-1-0-5.o
#12 0x00007ffff68d8ed0 in nebmodule_init () from /opt/statusengine/bin/naemon/statusengine-1-0-5.o
#13 0x00007ffff7b6524b in neb_load_module () from /usr/lib/naemon/libnaemon.so.0
#14 0x00007ffff7b653d8 in neb_load_all_modules () from /usr/lib/naemon/libnaemon.so.0
#15 0x00005555555574e1 in main (argc=<optimized out>, argv=<optimized out>) at src/naemon/naemon.c:573
dhoffend commented 6 years ago

Okay, got one ... one more to go. It looks like the module has a a problem when the gearman server is unavailable and you create a log message (logging the connection error) it triggers a another broker event that wants to you put a message into the log queue ... this somehow creates log loops.

I've played around to ignore creating error logs when we're handling log_data events ... but this requires more thoughts. We wanna have logs (in the file) but not creating a log loop

nook24 commented 6 years ago

Oh yes, I remember. The recursive log message issue is the reason, why the broker run into a segfault if you stop the gearman job server...

dhoffend commented 6 years ago

Okay then I’m on the right track. The first issue is as that the gearman client must be created before registering the eventhandler. The next step will be a variable to block log messages while processing logentries otherwise you end up with a segfault.

I’ll push a proposal later to round this request up

nook24 commented 6 years ago

Hi @dhoffend. unfortunately I found a minor bug. If you load the broker with one gearman server as argument, Naemon will stuck in the broker initialize and not execute any checks:

broker_module=/opt/dhoffend/statusengine-1-0-5.o gearman_server_list=localhost:4730

Also the line Successfully launched command file worker with pid 2304 is missing in the Naemon output. The only way to kill Naemon, was kill -9, so I guess the broker is some where in an endless loop?

bildschirmfoto 2018-07-08 um 20 01 24

If I load the broker without any arguments, it seems to work.

nook24 commented 6 years ago

Ok, the issue only occurs, if I use localhost instead of 127.0.0.1. So i guess this issue would also happened with the old argument gearman_server_addrn

dhoffend commented 6 years ago

I came across the same bug while doing tests on Friday. I thought it would be my test environment. But changing to IP did the job ...

I haven’t checked it yet if gearman_client is doing name lookups every job ... it is a bit strange.

dhoffend commented 6 years ago

Maybe there’s a difference between gearman_client_add_server and gearman_client_add_servers.

dhoffend commented 6 years ago

I came across a memory leak bug. While running 60k checks I could monitor how naemon was consuming more and more memory. After 3 days the process it was consuming 8gb and more. It was reproduce able.

First I was search my patches but finally I found out that the variable raw_command was allocated but never freed up. This bug also exists in the master branch.

Edit/Update: The memory leak was rarely noticed, because when naemon gets restarted regularly to activate a new configuration or had a low number of checks the memory consumption was quite okay ... but with huge number of checks the memory increase was quite noticeable.

nook24 commented 6 years ago

Hi @dhoffend, many thanks for your fix! I really have to apologize that your pull request is still not merged :( I found myself stuck in the "Sommerloch" and don't touch PCs in my spare time at the moment.

I will definitely take a look at your PR soon and merge it into the master!

nook24 commented 6 years ago

I checked the localhost vs 127.0.0.1 issue. This is also broker when using gearman_client_add_server instead of gearman_client_add_servers so ¯_(ツ)_/¯

nook24 commented 6 years ago

Moved into a new PR to apply some fixes: https://github.com/statusengine/module/pull/2

nook24 commented 6 years ago

Hi @dhoffend,

your PR (#2) was merged :) Updated documentation: https://statusengine.org/broker/#broker-options

Changelog: https://statusengine.org/roadmap/#module-3.1.0

Thanks for your support!