p-alik / perl-Gearman

9 stars 10 forks source link

Gearman workers intensively hog CPU while idling (instead of sleeping) #28

Closed silent-at-gh closed 6 years ago

silent-at-gh commented 7 years ago

Hi, the issue has been observed aprox. starting from the middle of Oct 2017 (I assume after update of module on CPAN from 2.004.008->2.004.009).

System info:

[silent@vbox ~]$ cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core)
[silent@vbox ~]$ uname -a
Linux vbox.intra 3.10.0-123.20.1.el7.x86_64 #1 SMP Thu Jan 29 18:05:33 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@dev_gearman-worker:~]# perl -MGearman::Worker -we 'print($Gearman::Worker::VERSION, $/);'
2.004.009

STR:

Observed result:

Expected result:

Workaround:

Additional info

esabol commented 7 years ago

I can confirm this issue. With 2.004.009:

USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
workmgr   13440  2.1  0.5 223008 22608 ?        S    18:27   0:00 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
workmgr   13441 59.8  0.5 227316 21372 ?        S    18:27   0:16 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
workmgr   13442 43.0  0.5 227316 21372 ?        S    18:27   0:11 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
workmgr   13443 43.6  0.5 227316 21372 ?        S    18:27   0:11 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
workmgr   13444 62.0  0.5 227316 21372 ?        S    18:27   0:16 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
workmgr   13445 63.5  0.5 227316 21376 ?        R    18:27   0:17 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl

And with 2.004.008:

USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
workmgr   13473  4.2  0.5 223032 22628 ?        S    18:28   0:00 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
workmgr   13474  0.1  0.5 227332 21388 ?        S    18:28   0:00 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
workmgr   13475  0.0  0.5 227332 21392 ?        S    18:28   0:00 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
workmgr   13476  0.0  0.5 227332 21392 ?        S    18:28   0:00 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
workmgr   13477  0.0  0.5 227332 21396 ?        S    18:28   0:00 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
workmgr   13478  0.0  0.5 227332 21396 ?        S    18:28   0:00 /usr/bin/perl -T -w /path/to/my_worker_ssl_dev.pl
esabol commented 7 years ago

@p-alik added the wontfix label

Was that intentional? I hope that's a mistake.

p-alik commented 7 years ago

I beg your pardon, @esabol. Indeed it was a mistake. The issue is done in upstream.

esabol commented 7 years ago

Cool! I'll try to test the upstream branch tomorrow.

esabol commented 7 years ago

Well, I have tested the upstream branch, and I have good news and bad news.

The good news is that it solves the problem with the CPU usage. 10 seconds after starting the workers, ps auxgww shows 0.0% CPU usage.

The bad news is that the workers' response times are at least 25 times slower in my benchmarks. :( This is in comparison to version 2.004.008. (2.004.009 has even worse problems.)

I am using a simple SSL worker that just responds with "pong" when it receives the task "ping". Doing 100 iterations of running "ping" task using the identical SSL client code:

Benchmark: timing 100 iterations of 47300, 47301...
     47300:  1 wallclock secs ( 0.14 usr +  0.02 sys =  0.16 CPU) @ 625.00/s (n=100)
            (warning: too few iterations for a reliable count)
     47301: 30 wallclock secs ( 0.17 usr +  0.02 sys =  0.19 CPU) @ 526.32/s (n=100)
            (warning: too few iterations for a reliable count)
Done.

Port 47300 is running 2.004.008 SSL. Port running 47301 is github/upstream SSL.

If I revert the workers connected to the gearmand on port 47301 to 2.004.008, and re-run my benchmarks:

Benchmark: timing 100 iterations of 47300, 47301...
     47300:  0 wallclock secs ( 0.06 usr +  0.01 sys =  0.07 CPU) @ 1428.57/s (n=100)
            (warning: too few iterations for a reliable count)
     47301:  0 wallclock secs ( 0.05 usr +  0.01 sys =  0.06 CPU) @ 1666.67/s (n=100)
            (warning: too few iterations for a reliable count)
Done.

I would call this an unacceptable performance regression.

p-alik commented 7 years ago

Thank you, @esabol. I separated result of your benchmark in an other issue: #29

p-alik commented 6 years ago

@silent-at-gh, bug fixing will be released to cpan with v2.004.010 soon.

p-alik commented 6 years ago

v2.004.010 was uploaded to CPAN.