100% CPU usage on Linux (epoll selector bug)

GoogleCodeExporter commented 9 years ago

What version of the product are you using? On what operating system?
spymemcached-2.8.4 on Ubuntu 12.04 Linux (64bit).

We are running a web application in a tomcat environment and use the 
memcached-session-manager.
Since switiching to the memcached-session-manager we observer a high cpu load 
on a totally idle system (no requests, no jobs, nada).
A thread dump told me, that this was the only active thread:
"Memcached IO over {MemcachedConnection to 
/xxx.xxx.xxx.xxx:11211/xxx.xxx.xxx.xxx:11211}" prio=10 tid=0x00007fc7dd095000 
nid=0x491b runnable [0x00007fc6dc080000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
        - locked <0x0000000750928190> (a sun.nio.ch.Util$2)
        - locked <0x00000007509281a8> (a java.util.Collections$UnmodifiableSet)
        - locked <0x0000000750946098> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
        at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:217)
        at net.spy.memcached.MemcachedConnection.run(MemcachedConnection.java:836)

The I started searching the internet for "java selector epollwait" and found 
some interesting articles about the (epoll selector bug).

Next step was - since a lot of hints say "update your jvm" - updating the jvm. 
But neither 1.7.0_21 nor 1.6.0_43 changed the situation, still high cpuload.

This article I think explains best what needs to be done in 
net.spy.memcached.MemcachedConnection.handleIO
to fix the issue:
https://issues.apache.org/jira/browse/DIRMINA-678
especially this comment:
"FYI, the problem was that the select() method could return immediately, but 
with a 0 value, in some specific cases. Sadly, in this case, we just loop and 
do a select() again, which returns 0 immediately, etc (so the 100% CPU). The 
workaround was to compute the time we spent on the select(), and if under 
100ms, when it returns 0, we consider that the selector is FU, so we kill it, 
create a new one, and copy all the selectionKeys in the new selector."

Does anybody observer the same behaviour?
And would it be possible to improve the handleIO method to handle that 
situation?

Sadly my skills are beyond providing a patch myself.

Original issue reported on code.google.com by phoe...@gmail.com on 7 Jun 2013 at 1:25

GoogleCodeExporter commented 9 years ago

Okay, so being in epollWait is not an issue per se, because its waiting there 
to get notification. We expect the IO thread to be alive, but it shouldnt 
consume 100% cpu.

Can you do a profile of the system while running and give us a dump on what 
actually causes 100% cpu? Just because the thread is alive doesnt mean its 
consuming it completely.

Let me know,
thanks
Michael

Original comment by michael....@gmail.com on 7 Jun 2013 at 1:31

GoogleCodeExporter commented 9 years ago

With which tool could I do that easiest?
Right now I just peek into the vm with jstatd and jvisualvm (no concole output, 
but in the table the memcached threads

  Thread                                                                run            sleeping         wait           monitor
Memcached IO over {MemcachedConnection to /xxx.xxx.xxx.xxx:11211}   3:34.215 
(100.0%)    0.0 (0.0%)  0.0 (0.0%)  0.0 (0.0%)  3:34.215
Memcached IO over {MemcachedConnection to /xxx.xxx.xxx.xxx:11211}   3:34.215 
(100.0%)    0.0 (0.0%)  0.0 (0.0%)  0.0 (0.0%)  3:34.215

never sleep or wait.

Original comment by phoe...@gmail.com on 7 Jun 2013 at 2:45

GoogleCodeExporter commented 9 years ago

One more thing:
When I switch the webapplication to the standard session manager (no memcached 
access), cpu load is as expected almost zero when the application is idle.

And I do not observe the behavior on my windows machine, so the bug seem to 
depend on the oracle os specific native code of the vm.

Original comment by phoe...@gmail.com on 10 Jun 2013 at 7:47

GoogleCodeExporter commented 9 years ago

Ok, I could solve the problem.
The cpu load is caused by the vm itself. After further profiling I found out 
that epollwait caused the high cpu load due to a "bad" to garbage collection 
setting for the vm. -XX:CMSInitiatingOccupancyFraction was set to 50 which lead 
to many gc calls which in fact caused a high cpu load on epoolwait. (Why this 
happens I don't know - this could have many causes). Setting the value to 80 
made the high cpu load disappear. I was mislead due to the fact that when I 
disabled memcacheclient cpu load dropped too.

Please close the issue.

Original comment by phoe...@gmail.com on 17 Jun 2013 at 1:36

GoogleCodeExporter commented 9 years ago

Original comment by ingen...@gmail.com on 17 Jun 2013 at 1:58

Changed state: Invalid

roc230 / spymemcached

100% CPU usage on Linux (epoll selector bug) #279