Gearman workers intensively hog CPU while idling (instead of sleeping)

silent-at-gh commented 7 years ago

Hi, the issue has been observed aprox. starting from the middle of Oct 2017 (I assume after update of module on CPAN from 2.004.008->2.004.009).

System info:

[silent@vbox ~]$ cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core)
[silent@vbox ~]$ uname -a
Linux vbox.intra 3.10.0-123.20.1.el7.x86_64 #1 SMP Thu Jan 29 18:05:33 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@dev_gearman-worker:~]# perl -MGearman::Worker -we 'print($Gearman::Worker::VERSION, $/);'
2.004.009

STR:

run perl application that spawns aprox. 10 workers (indeed issue is observed even with 1-2 workers);
do not submit any tasks to job server to let workers idle after start;

Observed result:

in a matter of 5-10 seconds workers processes bump load averages up to 25-30;
all worker porcesses are in running state;
workers and job server communicate at very high rate (see strace shots in the attachment)

Expected result:

workers should sleep and there should be nearly zero contribution to system load by workers awaiting for a jobs to run;

Workaround:

downgrade perl-Gearman to v. 2.004.008

Additional info

process list:

1  [###########################################*******************100.0%]   Tasks: 111, 292 thr; 35 running
2  [#############################################*****************100.0%]   Load average: 28.94 10.57 5.22 
3  [#############################################*****************100.0%]   Uptime: 48 days, 10:34:23
Mem[|||||||||||###******************                         2.08G/13.2G]
Swp[||                                                       51.6M/3.00G]

PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command                                                                                       
14963 unbound    20   0  169M 55300  2424 R  9.9  0.4  0:07.30 perl /opt/selc/script/selc gearman worker start --foreground
14933 unbound    20   0  186M 68068  2612 R  9.3  0.5  0:07.26 perl /opt/selc/script/selc gearman worker start --foreground
14962 unbound    20   0  169M 55304  2424 R  9.3  0.4  0:07.27 perl /opt/selc/script/selc gearman worker start --foreground
14955 unbound    20   0  165M 53324  2352 R  9.3  0.4  0:07.29 perl /opt/selc/script/selc gearman worker start --foreground                                  
14942 unbound    20   0  186M 68092  2616 R  9.3  0.5  0:07.26 perl /opt/selc/script/selc gearman worker start --foreground
14951 unbound    20   0  177M 61668  2544 R  8.6  0.4  0:06.31 perl /opt/selc/script/selc gearman worker start --foreground
14947 unbound    20   0  169M 55224  2420 R  8.6  0.4  0:06.31 perl /opt/selc/script/selc gearman worker start --foreground
14965 unbound    20   0  169M 55304  2424 R  8.6  0.4  0:06.31 perl /opt/selc/script/selc gearman worker start --foreground
14956 unbound    20   0  163M 50548  2292 R  8.6  0.4  0:06.31 perl /opt/selc/script/selc gearman worker start --foreground
14938 unbound    20   0  163M 50440  2272 R  8.6  0.4  0:07.28 perl /opt/selc/script/selc gearman worker start --foreground
14940 unbound    20   0  163M 50440  2272 R  8.6  0.4  0:07.29 perl /opt/selc/script/selc gearman worker start --foreground
14936 unbound    20   0  186M 68080  2616 R  8.6  0.5  0:07.26 perl /opt/selc/script/selc gearman worker start --foreground
14953 unbound    20   0  228M 99380  2680 R  7.9  0.7  0:06.59 perl /opt/selc/script/selc gearman worker start --foreground
14946 unbound    20   0  186M 68096  2616 R  7.9  0.5  0:06.29 perl /opt/selc/script/selc gearman worker start --foreground
14943 unbound    20   0  186M 68088  2616 R  7.9  0.5  0:06.28 perl /opt/selc/script/selc gearman worker start --foreground
14960 unbound    20   0  163M 50552  2296 R  7.9  0.4  0:06.29 perl /opt/selc/script/selc gearman worker start --foreground
14939 unbound    20   0  163M 50440  2272 R  7.9  0.4  0:06.29 perl /opt/selc/script/selc gearman worker start --foreground
14961 unbound    20   0  169M 55300  2424 R  4.6  0.4  0:06.81 perl /opt/selc/script/selc gearman worker start --foreground
14954 unbound    20   0  165M 53324  2352 R  4.6  0.4  0:06.80 perl /opt/selc/script/selc gearman worker start --foreground
14932 unbound    20   0  186M 68072  2612 R  4.6  0.5  0:06.81 perl /opt/selc/script/selc gearman worker start --foreground
14959 unbound    20   0  163M 50548  2292 R  4.6  0.4  0:06.80 perl /opt/selc/script/selc gearman worker start --foreground
14964 unbound    20   0  169M 55304  2424 R  4.6  0.4  0:06.79 perl /opt/selc/script/selc gearman worker start --foreground
14941 unbound    20   0  163M 50448  2280 R  4.6  0.4  0:06.79 perl /opt/selc/script/selc gearman worker start --foreground
14945 unbound    20   0  186M 68096  2616 R  4.6  0.5  0:06.79 perl /opt/selc/script/selc gearman worker start --foreground
14934 unbound    20   0  186M 68080  2616 R  4.6  0.5  0:06.54 perl /opt/selc/script/selc gearman worker start --foreground
14949 unbound    20   0  169M 55224  2420 R  4.6  0.4  0:06.49 perl /opt/selc/script/selc gearman worker start --foreground
14950 unbound    20   0  177M 61660  2544 R  4.6  0.4  0:06.55 perl /opt/selc/script/selc gearman worker start --foreground
14935 unbound    20   0  186M 68068  2612 R  4.0  0.5  0:06.79 perl /opt/selc/script/selc gearman worker start --foreground
14957 unbound    20   0  163M 50548  2292 R  4.0  0.4  0:06.48 perl /opt/selc/script/selc gearman worker start --foreground
14948 unbound    20   0  169M 55224  2420 R  4.0  0.4  0:06.50 perl /opt/selc/script/selc gearman worker start --foreground
14958 unbound    20   0  163M 50552  2296 R  4.0  0.4  0:06.51 perl /opt/selc/script/selc gearman worker start --foreground
14952 unbound    20   0  228M 99380  2680 R  4.0  0.7  0:06.89 perl /opt/selc/script/selc gearman worker start --foreground
14937 unbound    20   0  163M 50448  2280 R  4.0  0.4  0:06.50 perl /opt/selc/script/selc gearman worker start --foreground
14944 unbound    20   0  186M 68080  2612 R  4.0  0.5  0:06.52 perl /opt/selc/script/selc gearman worker start --foreground
14841 root       20   0 79804  2020  1480 S  0.0  0.0  0:00.01 gearman-worker -s /bin/bash - gearmand -c /opt/selc/script/selc gearman worker start --foregro
14860 unbound    20   0  158M 52448  4572 S  0.0  0.4  0:00.66 perl /opt/selc/script/selc gearman worker start --foreground

strace of worker process:

Trace of process 14951 - perl /opt/selc/script/selc gearman worker start --foreground                                                                        
strace: Process 14951 attached
select(72, [64], NULL, NULL, {0, 500000}) = 1 (in [64], left {0, 499997})
fcntl(64, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(64, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
select(72, [64], NULL, NULL, NULL)      = 1 (in [64])
read(64, "\0RES\0\0\0\n\0\0\0\0", 12)   = 12
fcntl(64, F_GETFL)                      = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(64, F_SETFL, O_RDWR)              = 0
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER, 0x7fd8e8f455e0}, {0x49c2c0, [], SA_RESTORER, 0x7fd8e8f455e0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0                                                                                                                 
write(64, "\0REQ\0\0\0\4\0\0\0\0", 12)  = 12
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
rt_sigaction(SIGPIPE, {0x49c2c0, [], SA_RESTORER, 0x7fd8e8f455e0}, {SIG_IGN, [], SA_RESTORER, 0x7fd8e8f455e0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
getpeername(64, {sa_family=AF_INET, sin_port=htons(4730), sin_addr=inet_addr("10.0.0.12")}, [16]) = 0
select(72, NULL, [64], NULL, {10, 0})   = 1 (out [64], left {9, 999855})
getpeername(64, {sa_family=AF_INET, sin_port=htons(4730), sin_addr=inet_addr("10.0.0.12")}, [16]) = 0
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER, 0x7fd8e8f455e0}, {0x49c2c0, [], SA_RESTORER, 0x7fd8e8f455e0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
write(64, "\0REQ\0\0\0\t\0\0\0\0", 12)  = 12
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
rt_sigaction(SIGPIPE, {0x49c2c0, [], SA_RESTORER, 0x7fd8e8f455e0}, {SIG_IGN, [], SA_RESTORER, 0x7fd8e8f455e0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(72, [64], NULL, NULL, {0, 500000}) = 1 (in [64], left {0, 499997})
fcntl(64, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(64, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
select(72, [64], NULL, NULL, NULL)      = 1 (in [64])
read(64, "\0RES\0\0\0\n\0\0\0\0", 12)   = 12
fcntl(64, F_GETFL)                      = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(64, F_SETFL, O_RDWR)              = 0
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER, 0x7fd8e8f455e0}, {0x49c2c0, [], SA_RESTORER, 0x7fd8e8f455e0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
write(64, "\0REQ\0\0\0\4\0\0\0\0", 12)  = 12
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
rt_sigaction(SIGPIPE, {0x49c2c0, [], SA_RESTORER, 0x7fd8e8f455e0}, {SIG_IGN, [], SA_RESTORER, 0x7fd8e8f455e0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
getpeername(64, {sa_family=AF_INET, sin_port=htons(4730), sin_addr=inet_addr("10.0.0.12")}, [16]) = 0
F3Search F4Filter F8AutoScroll F9Stop Tracing   EscDone

strace of job server process:

Trace of process 7681 - gearmand -u gearmand --verbose INFO --log-file /var/log/selc/gearmand.log --port 4730 --round-robin                                  
strace: Process 7681 attached                                                                                                                                
epoll_wait(13, [{EPOLLIN, {u32=42, u64=42}}], 32, -1) = 1
recvfrom(42, "\0REQ\0\0\0\t\0\0\0\0", 8192, MSG_DONTWAIT, NULL, NULL) = 12
futex(0x7f042d042f84, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f042d042f80, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7f042d042f58, FUTEX_WAKE_PRIVATE, 1) = 1
recvfrom(42, 0x7f042003e0e8, 8192, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(13, [{EPOLLIN, {u32=16, u64=16}}], 32, -1) = 1
read(16, "\4", 256)                     = 1
sendto(42, "\0RES\0\0\0\n\0\0\0\0", 12, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 12
read(16, 0x7f0425666bc0, 256)           = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(13, [{EPOLLIN, {u32=34, u64=34}}], 32, -1) = 1
recvfrom(34, "\0REQ\0\0\0\4\0\0\0\0", 8192, MSG_DONTWAIT, NULL, NULL) = 12
recvfrom(34, 0x7f0420020188, 8192, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(13, [{EPOLLIN, {u32=66, u64=66}}], 32, -1) = 1
recvfrom(66, "\0REQ\0\0\0\t\0\0\0\0", 8192, MSG_DONTWAIT, NULL, NULL) = 12
futex(0x7f042d042f84, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f042d042f80, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7f042d042f58, FUTEX_WAKE_PRIVATE, 1) = 1
recvfrom(66, 0x7f0420039b08, 8192, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(13, [{EPOLLIN, {u32=16, u64=16}}], 32, -1) = 1
read(16, "\4", 256)                     = 1
sendto(66, "\0RES\0\0\0\n\0\0\0\0", 12, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 12
read(16, 0x7f0425666bc0, 256)           = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(13, [{EPOLLIN, {u32=50, u64=50}}], 32, -1) = 1
recvfrom(50, "\0REQ\0\0\0\t\0\0\0\0", 8192, MSG_DONTWAIT, NULL, NULL) = 12
futex(0x7f042d042f84, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f042d042f80, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7f042d042f58, FUTEX_WAKE_PRIVATE, 1) = 1
recvfrom(50, 0x7f0420002ac8, 8192, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(13, [{EPOLLIN, {u32=16, u64=16}}], 32, -1) = 1
read(16, "\4", 256)                     = 1
sendto(50, "\0RES\0\0\0\n\0\0\0\0", 12, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 12
read(16, 0x7f0425666bc0, 256)           = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(13, [{EPOLLIN, {u32=62, u64=62}}], 32, -1) = 1
recvfrom(62, "\0REQ\0\0\0\t\0\0\0\0", 8192, MSG_DONTWAIT, NULL, NULL) = 12
futex(0x7f042d042f84, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f042d042f80, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7f042d042f58, FUTEX_WAKE_PRIVATE, 1) = 1
recvfrom(62, 0x7f0420046df8, 8192, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(13, [{EPOLLIN, {u32=16, u64=16}}], 32, -1) = 1
read(16, "\4", 256)                     = 1
sendto(62, "\0RES\0\0\0\n\0\0\0\0", 12, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 12
read(16, 0x7f0425666bc0, 256)           = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(13, [{EPOLLIN, {u32=54, u64=54}}], 32, -1) = 1
recvfrom(54, "\0REQ\0\0\0\4\0\0\0\0", 8192, MSG_DONTWAIT, NULL, NULL) = 12
futex(0x7f042d042f84, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f042d042f80, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7f042d042f58, FUTEX_WAKE_PRIVATE, 1) = 1
recvfrom(54, 0x7f04200357a8, 8192, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(13, [{EPOLLIN, {u32=58, u64=58}}], 32, -1) = 1
recvfrom(58, "\0REQ\0\0\0\t\0\0\0\0", 8192, MSG_DONTWAIT, NULL, NULL) = 12
futex(0x7f042d042f84, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f042d042f80, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7f042d042f58, FUTEX_WAKE_PRIVATE, 1) = 0
recvfrom(58, 0x7f042002f0d8, 8192, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(13, [{EPOLLIN, {u32=16, u64=16}}], 32, -1) = 1
F3Search F4Filter F8AutoScroll F9Stop Tracing   EscDone

esabol commented 7 years ago

I can confirm this issue. With 2.004.009:

USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
workmgr   13440  2.1  0.5 223008 22608 ?        S    18:27   0:00 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
workmgr   13441 59.8  0.5 227316 21372 ?        S    18:27   0:16 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
workmgr   13442 43.0  0.5 227316 21372 ?        S    18:27   0:11 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
workmgr   13443 43.6  0.5 227316 21372 ?        S    18:27   0:11 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
workmgr   13444 62.0  0.5 227316 21372 ?        S    18:27   0:16 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
workmgr   13445 63.5  0.5 227316 21376 ?        R    18:27   0:17 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl

And with 2.004.008:

USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
workmgr   13473  4.2  0.5 223032 22628 ?        S    18:28   0:00 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
workmgr   13474  0.1  0.5 227332 21388 ?        S    18:28   0:00 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
workmgr   13475  0.0  0.5 227332 21392 ?        S    18:28   0:00 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
workmgr   13476  0.0  0.5 227332 21392 ?        S    18:28   0:00 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
workmgr   13477  0.0  0.5 227332 21396 ?        S    18:28   0:00 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
workmgr   13478  0.0  0.5 227332 21396 ?        S    18:28   0:00 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl

esabol commented 7 years ago

@p-alik added the wontfix label

Was that intentional? I hope that's a mistake.

p-alik commented 7 years ago

I beg your pardon, @esabol. Indeed it was a mistake. The issue is done in upstream.

esabol commented 7 years ago

Cool! I'll try to test the upstream branch tomorrow.

esabol commented 7 years ago

Well, I have tested the upstream branch, and I have good news and bad news.

The good news is that it solves the problem with the CPU usage. 10 seconds after starting the workers, ps auxgww shows 0.0% CPU usage.

The bad news is that the workers' response times are at least 25 times slower in my benchmarks. :( This is in comparison to version 2.004.008. (2.004.009 has even worse problems.)

I am using a simple SSL worker that just responds with "pong" when it receives the task "ping". Doing 100 iterations of running "ping" task using the identical SSL client code:

Benchmark: timing 100 iterations of 47300, 47301...
     47300:  1 wallclock secs ( 0.14 usr +  0.02 sys =  0.16 CPU) @ 625.00/s (n=100)
            (warning: too few iterations for a reliable count)
     47301: 30 wallclock secs ( 0.17 usr +  0.02 sys =  0.19 CPU) @ 526.32/s (n=100)
            (warning: too few iterations for a reliable count)
Done.

Port 47300 is running 2.004.008 SSL. Port running 47301 is github/upstream SSL.

If I revert the workers connected to the gearmand on port 47301 to 2.004.008, and re-run my benchmarks:

Benchmark: timing 100 iterations of 47300, 47301...
     47300:  0 wallclock secs ( 0.06 usr +  0.01 sys =  0.07 CPU) @ 1428.57/s (n=100)
            (warning: too few iterations for a reliable count)
     47301:  0 wallclock secs ( 0.05 usr +  0.01 sys =  0.06 CPU) @ 1666.67/s (n=100)
            (warning: too few iterations for a reliable count)
Done.

I would call this an unacceptable performance regression.

p-alik commented 7 years ago

Thank you, @esabol. I separated result of your benchmark in an other issue: #29

p-alik commented 6 years ago

@silent-at-gh, bug fixing will be released to cpan with v2.004.010 soon.

p-alik commented 6 years ago

v2.004.010 was uploaded to CPAN.

p-alik / perl-Gearman

Gearman workers intensively hog CPU while idling (instead of sleeping) #28