sni / mod_gearman

Distribute Naemon Host/Service Checks & Eventhandler with Gearman Queues. Host/Servicegroups affinity included.
http://www.mod-gearman.org
GNU General Public License v3.0
122 stars 42 forks source link

sending job to gearmand failed since 5.1.2 (still OK in 5.1.0) #171

Open pvdputte opened 1 year ago

pvdputte commented 1 year ago

Hi,

Could there be a regression between 5.1.0 and 5.1.2 in submitting jobs to gearmand?

I'm using a clean Debian 11 install with the packages from OBS (as you recommend). This is from /var/log/mod-gearman/mod_gearman_neb.log with debug=4

[...]
[2023-08-09 01:15:37][26182][TRACE] move_results_to_core()
[2023-08-09 01:15:37][26182][TRACE] handle_svc_check(6, data)
[2023-08-09 01:15:37][26182][TRACE] handle_perfdata(6)
[2023-08-09 01:15:37][26182][TRACE] handle_svc_check(6, data)
[2023-08-09 01:15:37][26182][TRACE] got target queue from service custom variable: servicegroup_somegroup
[2023-08-09 01:15:37][26182][DEBUG] received job for queue servicegroup_somegroup: some.host.tld - MEMUSED, check_options: 0   latency so far: 0.286s
[2023-08-09 01:15:37][26182][TRACE] cmd_line: /usr/local/mon/libexec/check_nrpe -H 1.2.3.4 -t "15" -p "7070"  -c check-memused -a "-" "-" "-" "95" "97"
[2023-08-09 01:15:37][26182][TRACE] add_job_to_queue(servicegroup_somegroup, F5699DEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEF, 1, 1, 1, 0, 60)
[2023-08-09 01:15:37][26182][TRACE] 280 --->type=service
result_queue=check_results
host_name=some.host.tld
service_description=MEMUSED
core_time=1691536537.000288
timeout=120
command_line=/usr/local/mon/libexec/check_nrpe -H 1.2.3.4 -t "15" -p "7070"  -c check-memused -a "-" "-" "-" "95" "97"

<---
[2023-08-09 01:15:37][26182][TRACE] 384 +++>
12Py+XD5K<REDACTED>iW
<+++

[2023-08-09 01:15:42][26182][TRACE] create_client()
[2023-08-09 01:15:42][26182][INFO ] add_job_to_queue() retrying... 0
[2023-08-09 01:15:42][26182][TRACE] add_job_to_queue(servicegroup_somegroup, F5699DEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEF, 1, 0, 1, 0, 60)
[2023-08-09 01:15:42][26182][TRACE] 280 --->type=service
result_queue=check_results
host_name=some.host.tld
service_description=MEMUSED
core_time=1691536537.000288
timeout=120
command_line=/usr/local/mon/libexec/check_nrpe -H 1.2.3.4 -t "15" -p "7070"  -c check-memused -a "-" "-" "-" "95" "97"

<---
[2023-08-09 01:15:42][26182][TRACE] 384 +++>
12Py+XD5K<REDACTED>iW
<+++
[2023-08-09 01:15:47][26182][ERROR] sending job to gearmand failed: gearman_wait(GEARMAN_TIMEOUT) timeout reached, 1 servers were poll(), no servers were available, pipe:false -> libgearman/universal.cc:346: pid(26182)
[2023-08-09 01:15:47][26182][TRACE] create_client()
[2023-08-09 01:15:47][26182][TRACE] add_job_to_queue() finished with errors: 0
[2023-08-09 01:15:47][26182][ERROR] failed to send service check to gearmand
[2023-08-09 01:15:47][26182][TRACE] move_results_to_core()
[2023-08-09 01:15:47][26182][TRACE] handle_progam_status_data_events(11, data)
[2023-08-09 01:15:47][26182][TRACE] log file rotated: 0
[2023-08-09 01:15:47][26182][TRACE] handle_svc_check(6, data)
[2023-08-09 01:15:47][26182][TRACE] handle_perfdata(6)
[...]

By chance I had the 5.1.0 .deb still in the /var/cache/apt of a different testing VM. (note: this deb is from the labs.consol.de repository as I just recently switched, not sure if that matters).

If I dpkg -i the 5.1.0 one without changing my conffiles, it works again:

[...]
[2023-08-09 15:30:03][36755][TRACE] move_results_to_core()
[2023-08-09 15:30:03][36755][TRACE] handle_svc_check(6, data)
[2023-08-09 15:30:03][36755][TRACE] handle_perfdata(6)
[2023-08-09 15:30:03][36755][TRACE] handle_svc_check(6, data)
[2023-08-09 15:30:03][36755][TRACE] got target queue from service custom variable: servicegroup_somegroup
[2023-08-09 15:30:03][36755][DEBUG] received job for queue servicegroup_somegroup: some.host.tld - SSH, check_options: 0   latency so far: 0.513s
[2023-08-09 15:30:03][36755][TRACE] cmd_line: /usr/local/mon/libexec/check_ssh -t "10" -p "22" 1.2.3.4
[2023-08-09 15:30:03][36755][TRACE] add_job_to_queue(servicegroup_somegroup, 0D169733BEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFE68, 1, 1, 1, 1, 60)
[2023-08-09 15:30:03][36755][TRACE] 227 --->type=service
result_queue=check_results
host_name=some.host.tld
service_description=SSH
core_time=1691587803.000045
timeout=120
command_line=/usr/local/mon/libexec/check_ssh -t "10" -p "22" 1.2.3.4

<---
[2023-08-09 15:30:03][36755][TRACE] 320 +++>
12Py+XD5Kr8<REDACTED>kPddZBa552c6cpucUT0
<+++
[2023-08-09 15:30:03][36755][INFO ] gearmand submission statistics: jobs:      3   errors:       0   submit_rate:    0.0/s   avg_submit_duration: 0.000098s   max_submit_duration: 0.000145s
[2023-08-09 15:30:03][36755][TRACE] add_job_to_queue() finished successfully
[2023-08-09 15:30:03][36755][TRACE] handle_svc_check() finished successfully
[2023-08-09 15:30:03][36755][TRACE] handle_svc_check() finished successfully -> 206
[2023-08-09 15:30:03][36755][TRACE] got result H:com-dmcore001:47
[2023-08-09 15:30:03][36755][TRACE] 492 +++>
O8eJdV+43h55e<REDACTED>zXhzaSc=
<+++
[2023-08-09 15:30:03][36755][TRACE] 368 --->
type=active
host_name=some.host.tld
service_description=SSH
core_start_time=1691587803.000045
start_time=1691587804.909941
finish_time=1691587804.964980
return_code=0
exited_ok=1
source=Mod-Gearman Worker @ some-worker.tld
output=SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) | time=0.053162s;;;0.000000;10.000000

<---
[2023-08-09 15:30:03][36755][DEBUG] service job completed: some.host.tld SSH: exit 0, latency: 0.518, exec_time: 0.055
[2023-08-09 15:30:04][36755][TRACE] move_results_to_core()
[...]
pvdputte commented 1 year ago

So, for the record, this is a working combination.

# dpkg -l |grep -e naemon -e gearm
ii  gearman-job-server                   1.1.19.1+ds-2+b2               amd64        Job server for the Gearman distributed job queue
ii  gearman-tools                        1.1.19.1+ds-2+b2               amd64        Tools for the Gearman distributed job queue
ii  libgearman8:amd64                    1.1.19.1+ds-2+b2               amd64        Library providing Gearman client and worker functions
ii  libnaemon:amd64                      1.4.1-1                        amd64        Library for Naemon - common data files
ii  mod-gearman-module                   5.1.0                          amd64        Event broker module to distribute service checks.
ii  mod-gearman-tools                    5.1.2-1                        amd64        Tools for mod-gearman
ii  naemon-core                          1.4.1-1                        amd64        host/service/network monitoring and management system
ii  naemon-livestatus                    1.4.1-1                        amd64        contains the Naemon livestatus eventbroker module

Same on Debian 12 'bookworm' with gearman-job-server 1.1.20 by the way (which is where I noticed the issue first).

pvdputte commented 1 year ago

I did some manual compiles. 5.1 was still fine for me, 5.1.1 was broken. The problem starts right after this commit:

https://github.com/sni/mod_gearman/commit/87e22207c67e1e3305362b46e183a3130e8925ae

sni commented 1 year ago

Does it fail immediately?

pvdputte commented 1 year ago

Yes. 0 jobs arrive in gearmand.

pvdputte commented 8 months ago

FYI I did a reinstall on Debian 12 today using

ii  naemon-core    1.4.2-1      amd64        host/service/network monitoring and management system
ii  mod-gearman-module 5.1.2-1      amd64        Event broker module to distribute service checks.

and I got the same result.

I was keeping an old copy of the 5.1 binary to work around it, I'll have to recompile for the new v6 event broker API.

Error: Module '/usr/lib/x86_64-linux-gnu/mod_gearman/mod_gearman_naemon.o' is using an incompatible version (v5) of the event broker API (current version: v6). Module will be unloaded.
pvdputte commented 8 months ago

Or should I look into something else that could be wrongly configured on my system, as I find it hard to believe no-one else has been using mod-gearman-module on Debian the last few months. :thinking:

sni commented 8 months ago

or those packages from OBS? I assumed they automatically rebuild if there is an update in naemon-core.

pvdputte commented 8 months ago

I'm using

$ cat /etc/apt/sources.list.d/naemon-stable.list 
deb [signed-by=/etc/apt/trusted.gpg.d/naemon.asc] http://download.opensuse.org/repositories/home:/naemon/Debian_12/ ./

apt show confirms it.

sni commented 8 months ago

i will look into this. There was a new naemon release some days ago, i assume that's the issue here. NEB modules have to be rebuild against that version. I will trigger a rebuild for mod-gearman.

pvdputte commented 8 months ago

The build is fine for all I know, there's no NEB error starting naemon after installing. But using those packages still results in the problem described in the issue here: check jobs don't arrive in gearmand.

I only get that NEB v6 error when I try to replace /usr/lib/x86_64-linux-gnu/mod_gearman/mod_gearman_naemon.o with an older 5.1 version that still worked for me on Debian 12 with gearman-job-server 1.1.20+ds-1.

In other words: the issue is still the same with naemon 1.4.2 and the 5.1.2-1 version of mod-gearman-module on Debian 12, fetched from download.opensuse.org/repositories/home:/naemon/Debian_12

At least for me :slightly_smiling_face:

sni commented 8 months ago

I find it hard to believe no-one else has been using mod-gearman-module on Debian

I use it on several debian 12 machines, but i am using it with OMD-Labs which comes with its own gearmand. Had no issues so far.

pvdputte commented 8 months ago

I'm currently digging a bit deeper as I found that a plain vanilla install of all packages results in a working configuration.

Still a bit at a loss though as in the failing environment I now have identical naemon-core mod-gearman-module packages with identical

/etc/naemon/naemon.cfg
/etc/naemon/module-conf.d/mod-gearman.cfg
/etc/mod-gearman/module.conf

but no fix. I'll update this issue as soon as I find out more.

pvdputte commented 8 months ago

Turns out it's because I've been running gearmand for ages with --listen=0.0.0.0 to let remote mod-gearman-worker instances connect.

So I had no reason to suspect anything wrong with the gearmand setup (that I had not touched for a long time) and the mod-gearman-module suddenly stopped working properly with version 5.1.2 / this commit: https://github.com/sni/mod_gearman/commit/87e22207c67e1e3305362b46e183a3130e8925ae

I.e. these errors appeared out of nowhere:

sending job to gearmand failed: gearman_wait(GEARMAN_TIMEOUT) timeout reached, 1 servers were poll(), no servers were available

But by default in the Debian gearman-job-server package, gearmand is started with --listen=localhost, and when I revert to that, version 5.1.2 starts working properly again.

What had me puzzled is that the check_results queue was created and had naemon in it. But the host/notification/service queues etc. were never created because that part of mod-gearman-module 5.1.2 apparently no longer works if gearmand listens on 0.0.0.0 instead of localhost .

5.1.0 still worked with gearmand on 0.0.0.0 though. :shrug:

Anyway, the fix for me is to just remove the --listen=... thing, by default it listens on everything then.

Sorry to bother you, maybe this helps someone else later :slightly_smiling_face: