Increasing the max-worker causes buffer overruns

markkawika commented 10 years ago

When we increase the max-worker value on our worker nodes to around 1500-2000, we start seeing anomalous behavior by the workers: slow, unrelable, etc.

In our investigation, we've concluded that some part of mod-gearman is overwriting a buffer in the shared memory.

We have "debug=2" set, trying to track down the cause of our slowness, and we see several indications of corrupt shared memory. Here are a couple of examples:

First, we noticed a bunch of the message "double used worker slot". Here's an example output from the log file:

[2014-09-18 00:53:25][11916][ERROR] double used worker slot: 11916 != 536937252 [2014-09-18 00:53:26][11929][ERROR] double used worker slot: 11929 != 536937252 [2014-09-18 00:53:39][12090][ERROR] double used worker slot: 12090 != 536937252 [2014-09-18 00:53:40][12100][ERROR] double used worker slot: 12100 != 536937252 [2014-09-18 00:54:46][11917][ERROR] double used worker slot: 11917 != 958791696 [2014-09-18 00:54:46][11939][ERROR] double used worker slot: 11939 != 536937252 [2014-09-18 00:54:50][12991][ERROR] double used worker slot: 12991 != 536937252 [2014-09-18 01:08:11][12091][ERROR] double used worker slot: 12091 != 958791696 [2014-09-18 01:08:35][13079][ERROR] double used worker slot: 13079 != 536937252 [2014-09-18 01:08:35][12109][ERROR] double used worker slot: 12109 != 536937252 [2014-09-18 01:08:35][11844][ERROR] double used worker slot: 11844 != 536937252 [2014-09-18 01:09:02][12992][ERROR] double used worker slot: 12992 != 958791696 [2014-09-18 01:32:28][6894][ERROR] double used worker slot: 6894 != 536937252 [2014-09-18 01:32:52][7040][ERROR] double used worker slot: 7040 != 536937252

Next, we noticed that a piece of code that was closing stale workers was logging impossible values for the pid:

[2014-09-19 10:13:33][28316][TRACE] removed stale worker 1024, old pid: 825242203 [2014-09-19 10:13:33][28316][TRACE] removed stale worker 1025, old pid: 959458612 [2014-09-19 10:13:33][28316][TRACE] removed stale worker 1026, old pid: 540619053 [2014-09-19 10:13:33][28316][TRACE] removed stale worker 1027, old pid: 825897009 [2014-09-19 10:13:33][28316][TRACE] removed stale worker 1028, old pid: 858995251 [2014-09-19 10:13:33][28316][TRACE] removed stale worker 1029, old pid: 942824285 [2014-09-19 10:13:33][28316][TRACE] removed stale worker 1030, old pid: 1563832627 [2014-09-19 10:13:33][28316][TRACE] removed stale worker 1031, old pid: 1095914587 [2014-09-19 10:13:33][28316][TRACE] removed stale worker 1032, old pid: 542983491

I wondered about the pid being reported, so I converted the decimal values to hex, and then printed out the ASCII representation:

atla-bfp-29-sr4(nagios.prod.slave[production]):/var/log/mod_gearman$ (grep -E 'pid: -?[0-9][0-9][0-9][0-9][0-9][0-9]' mod_gearman_worker.log.1|awk '{print $NF}' | while read v; do hexstr=$(printf '%x' "$v"); escstr=$(echo $hexstr | sed -e 's/../\x&/g'); ascstr=$(printf "$escstr"); echo "$ascstr"; done)|head -15 102[ 90-4 91- 1:01 33:3 82[] ]613 ART[ ]EC teg txen mhs dni_ )(xe

If you account for big-endian vs. little-endian, that ASCII string is:

[2014-09-19 10:13:33][28316][TRACE] get_next_shm_index() ->

That's very clearly a log message that's over-writing the pid values in shared memory.

We also noticed this:

[2014-09-19 10:55:21][18957][INFO ] no checks in 2minutes, restarting all workers [2014-09-19 10:55:31][19295][INFO ] no checks in 2minutes, restarting all workers [2014-09-19 10:55:52][22272][INFO ] no checks in 2minutes, restarting all workers [2014-09-19 10:56:02][22617][INFO ] no checks in 2minutes, restarting all workers [2014-09-19 10:56:33][25504][INFO ] no checks in 2minutes, restarting all workers [2014-09-19 10:56:43][25787][INFO ] no checks in 2minutes, restarting all workers [2014-09-19 10:57:04][28722][INFO ] no checks in 2minutes, restarting all workers [2014-09-19 10:57:14][28936][INFO ] no checks in 2minutes, restarting all workers

Notice the time stamps for all of those indicate that there hasn't been anywhere close to 2 minutes between restarts.

I noticed that the value being checked is stored in shm, so it makes sense that this value is being corrupted too.

sni commented 10 years ago

not that it matters, but why do need 2000 worker on one host? Even on huge machines i never utilized more than 200 workers.

markkawika commented 10 years ago

We have a gigantic installation with thousands of servers, and the worker boxes were getting way backed up. We were trying to see if increasing max-worker would help.

jmcook commented 9 years ago

Just as a data point, we find that one worker process per check per second is plenty, even on our VMs. If you're running 2000 per second on a single worker box, you need more worker boxes.

markkawika commented 9 years ago

Whether or not we need 2000 workers is irrelevant to this bug. The bug is real.

deorth commented 9 years ago

Bump. Did anyone get a chance to look into this ?

I've just seen similar behaviors with the latest mod_gearman2. i.e. lots of [ERROR] double used worker slot: 29368 != 31891 in the log. then : [INFO ] no checks in 2minutes, restarting all workers

sni commented 2 years ago

Every worker uses 5 slots of shared memory counters, the maximum number is set here: https://github.com/sni/mod_gearman/blob/master/include/common.h#L149 Which means the maximum number of workers is ~800. Could you try to increase that constant and see if that helps?

deorth commented 2 years ago

Hi,

Neither Mark nor myself are in that role any more. Seven years is a long time.

Thanks,

Alan

sni commented 2 years ago

yeah, saw that right after i hit the comment button.

sni / mod_gearman

Increasing the max-worker causes buffer overruns #65