Closed markkawika closed 2 years ago
not that it matters, but why do need 2000 worker on one host? Even on huge machines i never utilized more than 200 workers.
We have a gigantic installation with thousands of servers, and the worker boxes were getting way backed up. We were trying to see if increasing max-worker would help.
Just as a data point, we find that one worker process per check per second is plenty, even on our VMs. If you're running 2000 per second on a single worker box, you need more worker boxes.
Whether or not we need 2000 workers is irrelevant to this bug. The bug is real.
Bump. Did anyone get a chance to look into this ?
I've just seen similar behaviors with the latest mod_gearman2. i.e. lots of [ERROR] double used worker slot: 29368 != 31891 in the log. then : [INFO ] no checks in 2minutes, restarting all workers
Every worker uses 5 slots of shared memory counters, the maximum number is set here: https://github.com/sni/mod_gearman/blob/master/include/common.h#L149 Which means the maximum number of workers is ~800. Could you try to increase that constant and see if that helps?
Hi,
Neither Mark nor myself are in that role any more. Seven years is a long time.
Thanks,
Alan
yeah, saw that right after i hit the comment button.
When we increase the max-worker value on our worker nodes to around 1500-2000, we start seeing anomalous behavior by the workers: slow, unrelable, etc.
In our investigation, we've concluded that some part of mod-gearman is overwriting a buffer in the shared memory.
We have "debug=2" set, trying to track down the cause of our slowness, and we see several indications of corrupt shared memory. Here are a couple of examples:
First, we noticed a bunch of the message "double used worker slot". Here's an example output from the log file:
[2014-09-18 00:53:25][11916][ERROR] double used worker slot: 11916 != 536937252 [2014-09-18 00:53:26][11929][ERROR] double used worker slot: 11929 != 536937252 [2014-09-18 00:53:39][12090][ERROR] double used worker slot: 12090 != 536937252 [2014-09-18 00:53:40][12100][ERROR] double used worker slot: 12100 != 536937252 [2014-09-18 00:54:46][11917][ERROR] double used worker slot: 11917 != 958791696 [2014-09-18 00:54:46][11939][ERROR] double used worker slot: 11939 != 536937252 [2014-09-18 00:54:50][12991][ERROR] double used worker slot: 12991 != 536937252 [2014-09-18 01:08:11][12091][ERROR] double used worker slot: 12091 != 958791696 [2014-09-18 01:08:35][13079][ERROR] double used worker slot: 13079 != 536937252 [2014-09-18 01:08:35][12109][ERROR] double used worker slot: 12109 != 536937252 [2014-09-18 01:08:35][11844][ERROR] double used worker slot: 11844 != 536937252 [2014-09-18 01:09:02][12992][ERROR] double used worker slot: 12992 != 958791696 [2014-09-18 01:32:28][6894][ERROR] double used worker slot: 6894 != 536937252 [2014-09-18 01:32:52][7040][ERROR] double used worker slot: 7040 != 536937252
Next, we noticed that a piece of code that was closing stale workers was logging impossible values for the pid:
[2014-09-19 10:13:33][28316][TRACE] removed stale worker 1024, old pid: 825242203 [2014-09-19 10:13:33][28316][TRACE] removed stale worker 1025, old pid: 959458612 [2014-09-19 10:13:33][28316][TRACE] removed stale worker 1026, old pid: 540619053 [2014-09-19 10:13:33][28316][TRACE] removed stale worker 1027, old pid: 825897009 [2014-09-19 10:13:33][28316][TRACE] removed stale worker 1028, old pid: 858995251 [2014-09-19 10:13:33][28316][TRACE] removed stale worker 1029, old pid: 942824285 [2014-09-19 10:13:33][28316][TRACE] removed stale worker 1030, old pid: 1563832627 [2014-09-19 10:13:33][28316][TRACE] removed stale worker 1031, old pid: 1095914587 [2014-09-19 10:13:33][28316][TRACE] removed stale worker 1032, old pid: 542983491
I wondered about the pid being reported, so I converted the decimal values to hex, and then printed out the ASCII representation:
atla-bfp-29-sr4(nagios.prod.slave[production]):/var/log/mod_gearman$ (grep -E 'pid: -?[0-9][0-9][0-9][0-9][0-9][0-9]' mod_gearman_worker.log.1|awk '{print $NF}' | while read v; do hexstr=$(printf '%x' "$v"); escstr=$(echo $hexstr | sed -e 's/../\x&/g'); ascstr=$(printf "$escstr"); echo "$ascstr"; done)|head -15 102[ 90-4 91- 1:01 33:3 82[] ]613 ART[ ]EC teg txen mhs dni_ )(xe
If you account for big-endian vs. little-endian, that ASCII string is:
[2014-09-19 10:13:33][28316][TRACE] get_next_shm_index() ->
That's very clearly a log message that's over-writing the pid values in shared memory.
We also noticed this:
[2014-09-19 10:55:21][18957][INFO ] no checks in 2minutes, restarting all workers [2014-09-19 10:55:31][19295][INFO ] no checks in 2minutes, restarting all workers [2014-09-19 10:55:52][22272][INFO ] no checks in 2minutes, restarting all workers [2014-09-19 10:56:02][22617][INFO ] no checks in 2minutes, restarting all workers [2014-09-19 10:56:33][25504][INFO ] no checks in 2minutes, restarting all workers [2014-09-19 10:56:43][25787][INFO ] no checks in 2minutes, restarting all workers [2014-09-19 10:57:04][28722][INFO ] no checks in 2minutes, restarting all workers [2014-09-19 10:57:14][28936][INFO ] no checks in 2minutes, restarting all workers
Notice the time stamps for all of those indicate that there hasn't been anywhere close to 2 minutes between restarts.
I noticed that the value being checked is stored in shm, so it makes sense that this value is being corrupted too.