sni / mod_gearman

Distribute Naemon Host/Service Checks & Eventhandler with Gearman Queues. Host/Servicegroups affinity included.
http://www.mod-gearman.org
GNU General Public License v3.0
122 stars 42 forks source link

have Issues with mod_gearman #151

Closed MohanGan closed 2 years ago

MohanGan commented 4 years ago

Hi @sni

We have deployed naemon core , gearmand and mod_gearman with below versions Server Version : Red Hat Enterprise Linux Server release 7.6 (Maipo) Naemon Core 1.2 Gearmand 0.33 mod_gearman 3.3.0

-bash-4.2$ naemon

Naemon Core 1.2.0 Copyright (c) 2013-present Naemon Core Development Team and Community Contributors Copyright (c) 2009-2013 Nagios Core Development Team and Community Contributors Copyright (c) 1999-2009 Ethan Galstad

-bash-4.2$ gearmand -V

gearmand 0.33 - https://bugs.launchpad.net/gearmand

-bash-4.2# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.6 (Maipo) -bash-4.2# yum list installed | grep mod_gearman Failed to set locale, defaulting to C mod_gearman.x86_64 3.3.0-1.el7 @nagios-server

Regularly we are seeing very strange issues on gearmand ,neb module and even on gearman worker end.

major issues we are seeing here is all cheks are showing as orphans

below is the output from gearman_top which show neb has died and workers are showing as 0 for check_results and check_results queue job waiting is piling up and gearman workers disconnect and connect back very often and sometimes workers available for gearman workers in below screenshot will show as 0 and 1 sometimes

we are seeing below error in gearmand log

tail -f /var/log/gearmand/gearmand.log ERROR 2020-04-26 04:41:38.000000 [ proc ] Job handle plus handle count beyond GEARMAND_JOB_HANDLE_SIZE: H:hostname:11888764 -> libgearman-server/job.c

image

One more weird thing we have observed on gearman worker end is gearman worker threads are not releasing memory and all the available memory will be in buffer and gearman worker process never comes up during the same time after worker log rotate happens ..

[2020-05-26 04:39:05][37018][DEBUG] got host job: hostname [2020-05-26 04:39:05][37018][ERROR] worker error: gearman_worker_grab_job(GEARMAN_UNEXPECTED_PACKET) unexpected packet:ERROR -> libgearman/worker.cc:781 [2020-05-26 04:39:06][36755][ERROR] worker error: gearman_worker_grab_job(GEARMAN_UNEXPECTED_PACKET) unexpected packet:ERROR -> libgearman/worker.cc:781 [2020-05-26 04:39:06][36962][ERROR] worker error: gearman_worker_grab_job(GEARMAN_UNEXPECTED_PACKET) unexpected packet:ERROR -> libgearman/worker.cc:781

==================== Below is the neb config

############################################################################### #

Mod-Gearman - distribute checks with gearman

#

Copyright (c) 2010 Sven Nierlein

#

Mod-Gearman NEB Module Config

# ###############################################################################

debug=1

logfile=/var/log/mod_gearman/mod_gearman_neb.log server=localhost:4730 eventhandler=no notifications=no services=yes hosts=yes do_hostchecks=yes route_eventhandler_like_checks=no encryption=yes keyfile=/etc/mod_gearman/gmSecret.key use_uniq_jobs=on ############################################################################### #

NEB Module Config

#

the following settings are for the neb module only and

will be ignored by the worker.

# ############################################################################### localservicegroups=gearman_bypass result_workers=1 perfdata=no perfdata_send_all=no perfdata_mode=1 orphan_host_checks=yes orphan_service_checks=yes orphan_return=2 accept_clear_results=no

Below is the worker config

Mod-Gearman - distribute checks with gearman

Copyright (c) 2010 Sven Nierlein

Worker Module Config

identifier=hostname.dev debug=1 debug-result=yes logfile=/var/log/mod_gearman/mod_gearman_worker_hostname_4730.log

server=hostname:4730 eventhandler=no notifications=no services=yes hosts=yes encryption=yes keyfile=/etc/mod_gearman/gmSecret.key job_timeout=60 min-worker=5 max-worker=400 idle-timeout=30 max-jobs=1000 max-age=0 spawn-rate=1 fork_on_exec=no load_limit1=30 load_limit5=0 load_limit15=0 show_error_output=yes timeout_return=2 enable_embedded_perl=on use_embedded_perl_implicitly=off use_perl_cache=on p1_file=/usr/share/mod_gearman/mod_gearman_p1.pl

Workarounds

workaround_rc_25=3 orphan_return=3

Here in our environment we are using naemon core, gearmand and live status as well.

Can you help us to fix this issue? Appreciate your help on this.

sni commented 4 years ago

No idea tbh and i don't think it makes sense to debug the gearman version 0.33 from over 8 years ago. We updated gearman to the latest 1.1.x in OMD last year and had good results so far. But i never had the chance to update the standalone packages. Idealy while migrating them to the open suse build service to have the builds more public and opensource. If you have the chance to use the latest gearmand and compile mod-gearman against that would be my best guess for now.

MohanGan commented 4 years ago

Thanks @sni very for your reply on this.. can you also provide us the rpm download link for latest stable gearmand 1.1.x as im not able to find downloadable rpm package from labsconsole . Also can you let us the exact version that is working well now.. also please let us know any doc or something on how we can compile mod-gearman against gearmand package? and what modifications or things to do to compile new gearmand package

tarball for gearmand 1.1.19.1 seems to be broken, and im not able to create rpm from source code

I have tried building rpm using source code from 1.1.19.1 and tar.gz file from 1.1.18, nothing seems working for me.. can you let me know which latest version of gearmand is working for you?

Also one more thing is we are setting up this new environment in Azure cloud FYI..

Thanks in Advance @sni

sni commented 4 years ago

We are using 1.1.19.1 mostly now. No modifations done except changing ssl path things which should be unrelated: See https://github.com/ConSol/omd/tree/labs/packages/gearmand