IO processing thread dies

GoogleCodeExporter commented 9 years ago

What version of the product are you using? On what operating system?
2.7

Tell me more...
IO Processing thread dies, and nothing else ever happens again, all threads 
that use spy end up stuck on 
net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:
290)

Not seeing any exceptions in the logs, but the MemcachedClient.run() method 
must be getting an unhandled exception that's either a RuntimeException or 
Throwable.

Best practice for Threads is to catch everything and anything and log it in the 
run() method, or to set an UncaughtException handler and have it do the 
logging. I think in this situation, that the run() method should try to 
continue no matter what until shutdown, and log the exceptions.

Also should put the getLogger().info("Shut down memcached client"); message in 
a finally{} block as that would have provided some logging insight as to what's 
going on.

Original issue reported on code.google.com by jonat...@gmail.com on 29 Jul 2011 at 4:26

GoogleCodeExporter commented 9 years ago

Looks like the Thread logic that was in MemcachedClient is now part of 
MemcachedConnection, and the same flaw exists there.

Original comment by jonat...@gmail.com on 29 Jul 2011 at 4:48

GoogleCodeExporter commented 9 years ago

workaround: 
client = new MemcachedClient(cf, AddrUtil.getAddresses(hostList)) {
   private volatile boolean running = false;
   @Override public void run() {
      while (running) {
         try {
            super.run();
         }
         catch (Throwable t) {
            getLogger().warn(t);
         }
      }
   }
   @Override public boolean shutdown(long timeout, TimeUnit unit) {
      try {
         return super.shutdown(timeout, unit);
      }
      finally {
         running = false;
      }
   }
};

Original comment by jonat...@gmail.com on 29 Jul 2011 at 7:10

GoogleCodeExporter commented 9 years ago

should be running = true; on line 2 of the workaround :)

Original comment by jonat...@gmail.com on 29 Jul 2011 at 10:20

GoogleCodeExporter commented 9 years ago

I think this may have been addressed in some recent contributed patches.

In particular, these two:
https://github.com/dustin/java-memcached-client/commit/02d2f3e05f7100b09c99f58c8
5cf3d41ee80bbb3
https://github.com/dustin/java-memcached-client/commit/a9035c9e8243aec93d9516c88
f33f6e62d4020c7

... but there's not enough info to say for sure.  Note these fixes are in the 
just released 2.7.3 and will be part of the upcoming 2.8.

The point about trying to keep going is a good one.  Going to mark this 
accepted for that reason.  May not get to it for a while, but if you'd like to 
flesh it out and contribute a patch, we'd be glad to take it.

Original comment by ingen...@gmail.com on 16 Oct 2011 at 2:11

Changed state: Accepted
Added labels: Priority-Low
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

I met the same problem when OutOfMemoryError was seen in logs. Some events 
happened in sequence:
1) Server load became very heavy and memory sharply shrunk.
2) Memcached thread throws OutOfMemoryError and died.
3) More operations are added to input queue but are never consumed.

The risk is catastrophic, isn't it? Note that worker threads are unaware of 
memcached dead event. Only one symptom is, application becomes very slow 
because get/put are full sooner or later. So could we put this issue into 
higher category and fix it in V2.8?

Original comment by smilingai2004@gmail.com on 8 Dec 2011 at 7:02

GoogleCodeExporter commented 9 years ago

Just to provide an update on this (sorry for the long delay), my current 
thought is that the best thing to do is to drop all connections, reestablish 
and restart the IO thread.  The concern is that we don't want to end up 
corrupting data, and getting back to a safe state is important. 

w.r.t. the comment from smilingai2004, the default behavior with the input 
queue is to immediately timeout operations if it's full.  This does mean there 
can be a lot of memory consumed.  This is still bad, of course.

Original comment by ingen...@gmail.com on 20 Jul 2012 at 7:00

GoogleCodeExporter commented 9 years ago

A proposed fix for this is posted here:
http://review.couchbase.org/#change,18693

We'd appreciate any comments on the change set.

Original comment by ingen...@gmail.com on 30 Jul 2012 at 9:28

GoogleCodeExporter commented 9 years ago

I added this fix (http://review.couchbase.org/#change,18693) to the code, and 
see the exception in our log (the magic number in response is 83-0x53, instead 
of 0x81). 

net.spy.memcached.MemcachedConnection       Problem handling memcached IO   
java.lang.AssertionError: Invalid magic:  83
        at net.spy.memcached.protocol.binary.OperationImpl.readFromBuffer(OperationImpl.java:134)
        at net.spy.memcached.MemcachedConnection.handleReads(MemcachedConnection.java:458)
        at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:376)
        at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:236)
        at net.spy.memcached.MemcachedConnection.run(MemcachedConnection.java:830)

Original comment by sunny201...@gmail.com on 27 Aug 2012 at 6:33

GoogleCodeExporter commented 9 years ago

Invalid magic would be rather bad. That indicates some kind of partial read or 
that something went wrong part way through a read.  Or something bad on the 
server side.  What was the server in question?

We'd like to get that fix reviewed in, since it may make us much more reliable.

Original comment by ingen...@gmail.com on 27 Aug 2012 at 7:07

GoogleCodeExporter commented 9 years ago

We are using memcached 1.4.14.
Client application is running on ubuntu.

Original comment by sunny201...@gmail.com on 27 Aug 2012 at 11:26

roc230 / spymemcached

IO processing thread dies #191