sni / mod_gearman

Distribute Naemon Host/Service Checks & Eventhandler with Gearman Queues. Host/Servicegroups affinity included.
http://www.mod-gearman.org
GNU General Public License v3.0
122 stars 42 forks source link

No Workers are running after ca. 6 days uptime #41

Closed joerg16 closed 10 years ago

joerg16 commented 11 years ago

After the Upgrade from mod_gearman Version 1.3.8 to 1.4.6 all mod_german workers are gone every 6 days without any erros. I can start them again and all works fine.

sni commented 11 years ago

Are there any worker processes left? Is even the status worker down? What version of libgearman do you use?

joerg16 commented 11 years ago

I hope your have received my mail. I have enabled the debug Modus and after the logrotate startet reload of mod_german_worker it looks like mod_german doesn`t get any jobs. [2013-08-25 11:59:47][23523][DEBUG] got service job: * - Service Proxi-Check * [2013-08-25 11:59:47][23531][DEBUG] got service job: * - Service Proxi-Check http://www.google.de [2013-08-25 11:59:48][23505][DEBUG] got service job: * - Service Proxi-Check * [2013-08-25 11:59:48][23523][DEBUG] got service job: * - Service Proxi-Check * [2013-08-25 11:59:48][23531][DEBUG] got service job: * - Service Proxi-Check * [2013-08-25 11:59:48][23671][DEBUG] child started with pid: 23671 [2013-08-25 11:59:50][23671][DEBUG] got service job: * - Service Webinject * [2013-08-25 12:00:00][23671][DEBUG] got service job: * - Service Proxi-Check * [2013-08-25 12:00:00][23523][DEBUG] got service job: * - Service Proxi-Check *****

After now it only starts childs without any jobs. [2013-08-25 12:00:01][5377][DEBUG] -------------------------------- [2013-08-25 12:00:01][5377][DEBUG] configuration: [2013-08-25 12:00:01][5377][DEBUG] log level: 1 [2013-08-25 12:00:01][5377][DEBUG] log mode: file (1) [2013-08-25 12:00:01][5377][DEBUG] identifier: ** [2013-08-25 12:00:01][5377][DEBUG] pidfile: /var/mod_gearman/mod_gearman_worker.pid [2013-08-25 12:00:01][5377][DEBUG] logfile: /var/log/mod_gearman/mod_gearman_worker.log [2013-08-25 12:00:01][5377][DEBUG] job max num: 1000 [2013-08-25 12:00:01][5377][DEBUG] job max age: 0 [2013-08-25 12:00:01][5377][DEBUG] job timeout: 60 [2013-08-25 12:00:01][5377][DEBUG] min worker: 3 [2013-08-25 12:00:01][5377][DEBUG] max worker: 15 [2013-08-25 12:00:01][5377][DEBUG] spawn rate: 1 [2013-08-25 12:00:01][5377][DEBUG] fork on exec: yes [2013-08-25 12:00:01][5377][DEBUG] [2013-08-25 12:00:01][5377][DEBUG] embedded perl: yes [2013-08-25 12:00:01][5377][DEBUG] use_epn_implicitly: no [2013-08-25 12:00:01][5377][DEBUG] use_perl_cache: yes [2013-08-25 12:00:01][5377][DEBUG] p1_file: /usr/share/mod_gearman/mod_gearman_p1.pl [2013-08-25 12:00:01][5377][DEBUG] [2013-08-25 12:00:01][5377][DEBUG] server: localhost:4730 [2013-08-25 12:00:01][5377][DEBUG] [2013-08-25 12:00:01][5377][DEBUG] [2013-08-25 12:00:01][5377][DEBUG] hosts: yes [2013-08-25 12:00:01][5377][DEBUG] services: yes [2013-08-25 12:00:01][5377][DEBUG] eventhandler: yes [2013-08-25 12:00:01][5377][DEBUG] [2013-08-25 12:00:01][5377][DEBUG] encryption: no [2013-08-25 12:00:01][5377][DEBUG] transport mode: base64 only [2013-08-25 12:00:01][5377][DEBUG] use uniq jobs: overwrite [2013-08-25 12:00:01][5377][DEBUG] -------------------------------- [2013-08-25 12:00:01][5377][INFO ] reloading config was successful [2013-08-25 12:00:01][23791][DEBUG] child started with pid: 23791 [2013-08-25 12:00:02][23802][DEBUG] child started with pid: 23802 [2013-08-25 12:00:02][23803][DEBUG] child started with pid: 23803 [2013-08-25 12:00:02][23804][DEBUG] child started with pid: 23804 [2013-08-25 12:00:02][23805][DEBUG] child started with pid: 23805 [2013-08-25 12:00:02][23806][DEBUG] child started with pid: 23806 [2013-08-25 12:00:03][23878][DEBUG] child started with pid: 23878 [2013-08-25 12:00:03][23879][DEBUG] child started with pid: 23879 [2013-08-25 12:00:03][23880][DEBUG] child started with pid: 23880 [2013-08-25 12:00:03][23881][DEBUG] child started with pid: 23881

sni commented 11 years ago

can you run "strace -fp " for 2-3 minutes if that happens. Seems like the worker cannot connect to gearmand anymore. Could you also run strace on the gearmand during the error?

You can send me the output by mail if you don't want to post them in public.

smetj commented 10 years ago

It seems to me that the "worker controller process" does not always process a SIGHUP signal correctly, ... I have observed the same on a development setup I have running.

As a workaround I'm logging now to syslog which makes the problem disappear ( since logrotate doesn't have to SIGHUP the worker process anymore).

sni commented 10 years ago

thanks for the update, i will have a look