stevenkang / spymemcached

Automatically exported from code.google.com/p/spymemcached
0 stars 0 forks source link

NPE preventing countdown latch from being counted down #166

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I'm using memcached 2.6rc1 on OsX. While load testing the system, I encountered 
a bunch of stack traces:

java.lang.NullPointerException
        at net.spy.memcached.MemcachedClient$6.receivedStatus(MemcachedClient.java:1007)
        at net.spy.memcached.ops.MultiOperationCallback.complete(MultiOperationCallback.java:32)
        at net.spy.memcached.protocol.BaseOperationImpl.transitionState(BaseOperationImpl.java:111)
        at net.spy.memcached.protocol.BaseOperationImpl.isTimedOut(BaseOperationImpl.java:172)
        at net.spy.memcached.protocol.TCPMemcachedNodeImpl.fillWriteBuffer(TCPMemcachedNodeImpl.java:163)
        at net.spy.memcached.MemcachedConnection.handleWrites(MemcachedConnection.java:365)
        at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:327)
        at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:199)
        at net.spy.memcached.MemcachedClient.run(MemcachedClient.java:1622)

This prevented the latch from being counted down, and therefore my threads were 
just hanging forever (we were using the default BulkGetFuture#get, which is 
using Long.MAX_VALUE as its latch wait time).

Thanks,
Jessica

Original issue reported on code.google.com by mewmewb...@gmail.com on 11 Mar 2011 at 10:54

GoogleCodeExporter commented 9 years ago
In 2.6rc1 there was an issue that would cause cancelled operations to block 
things.  This could be related to that.  Will investigate.

Original comment by ingen...@gmail.com on 4 May 2011 at 10:56

GoogleCodeExporter commented 9 years ago

Original comment by ingen...@gmail.com on 4 May 2011 at 10:57

GoogleCodeExporter commented 9 years ago
an NPE thrown out of the MemcachedClient.run() method will cause the memcached 
IO thread to die and spy will never succesfully perform another operation 
again. I logged issue 191 to get that exception handling cleaned up. I will 
post my workaround there.

Original comment by jonat...@gmail.com on 29 Jul 2011 at 7:06

GoogleCodeExporter commented 9 years ago

Original comment by mikewie...@gmail.com on 7 Sep 2011 at 12:49

GoogleCodeExporter commented 9 years ago
I just took a look at this issue and it was fixed by my change to add user 
level access to OperationStatus. The issue here was that when operations were 
cancelled or timed out that the complete() function could be called without 
calling the recievedStatus() function. This cause the status to be null in some 
cases.

Please re-open this issue if the problem still persists when using a release 
greater than 2.7.

Original comment by mikewie...@gmail.com on 13 Sep 2011 at 11:25

GoogleCodeExporter commented 9 years ago
using 2.8.0,the problem still persists

[2012-02-23 12:37:06.848]2012-02-23 12:37:06.848 INFO 
net.spy.memcached.MemcachedConnection:  Reconnecting due to exception on {QA 
sa=MC1/memcache_host:11211, #Rops=0, #Wops=331
, #iq=0, topRop=null, 
topWop=net.spy.memcached.protocol.ascii.GetOperationImpl@196c1cb4, toWrite=0, 
interested=4}
[2012-02-23 12:37:06.848]java.lang.NullPointerException
[2012-02-23 12:37:06.848]   at 
net.spy.memcached.MemcachedClient$6.receivedStatus(MemcachedClient.java:1007)
[2012-02-23 12:37:06.848]   at 
net.spy.memcached.ops.MultiOperationCallback.complete(MultiOperationCallback.jav
a:32)
[2012-02-23 12:37:06.848]   at 
net.spy.memcached.protocol.BaseOperationImpl.transitionState(BaseOperationImpl.j
ava:112)
[2012-02-23 12:37:06.848]   at 
net.spy.memcached.protocol.BaseOperationImpl.isTimedOut(BaseOperationImpl.java:1
73)
[2012-02-23 12:37:06.848]   at 
net.spy.memcached.protocol.TCPMemcachedNodeImpl.fillWriteBuffer(TCPMemcachedNode
Impl.java:165)
[2012-02-23 12:37:06.848]   at 
net.spy.memcached.MemcachedConnection.handleWrites(MemcachedConnection.java:365)
[2012-02-23 12:37:06.848]   at 
net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:327)
[2012-02-23 12:37:06.848]   at 
net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:199)
[2012-02-23 12:37:06.848]   at 
net.spy.memcached.MemcachedClient.run(MemcachedClient.java:1622)
[2012-02-23 12:37:06.848]2012-02-23 12:37:06.848 WARN 
net.spy.memcached.MemcachedConnection:  Closing, and reopening {QA 
sa=MC1/10.10.32.163:11211, #Rops=0, #Wops=331, #iq=0, t
opRop=null, topWop=net.spy.memcached.protocol.ascii.GetOperationImpl@196c1cb4, 
toWrite=0, interested=4}, attempt 0.
[2012-02-23 12:37:06.848]2012-02-23 12:37:06.848 INFO 
net.spy.memcached.protocol.ascii.AsciiMemcachedNodeImpl:  No buffer for current 
write op, removing

there is the other error before this  error

[2012-02-23 12:36:43.980]java.lang.IllegalArgumentException: Operation has 
already timed out; ttl specified would allow it to be valid.
[2012-02-23 12:36:43.981]   at 
net.spy.memcached.protocol.BaseOperationImpl.isTimedOut(BaseOperationImpl.java:1
79)
[2012-02-23 12:36:43.981]   at 
net.spy.memcached.protocol.TCPMemcachedNodeImpl.fillWriteBuffer(TCPMemcachedNode
Impl.java:165)
[2012-02-23 12:36:43.981]   at 
net.spy.memcached.MemcachedConnection.handleWrites(MemcachedConnection.java:365)
[2012-02-23 12:36:43.995]   at 
net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:327)
[2012-02-23 12:36:43.996]   at 
net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:199)
[2012-02-23 12:36:43.996]   at 
net.spy.memcached.MemcachedClient.run(MemcachedClient.java:1622)
[2012-02-23 12:36:43.996]2012-02-23 12:36:43.996 WARN 
net.spy.memcached.MemcachedConnection:  Closing, and reopening {QA 
sa=MC1/memcache_host:11211, #Rops=0, #Wops=1, #iq=2, top
Rop=null, topWop=net.spy.memcached.protocol.ascii.GetOperationImpl@54167222, 
toWrite=0, interested=4}, attempt 0.

Original comment by oyh.da...@gmail.com on 23 Feb 2012 at 6:23

GoogleCodeExporter commented 9 years ago
I have got this problem for a few days with spymemcached 2.8.0 too. And version 
2.5 may be the best choice now. 

Original comment by matrix3...@gmail.com on 27 Feb 2012 at 4:08

GoogleCodeExporter commented 9 years ago
I think I have figured out what is happening here.
Three things need to be true
1. Client needs to call OperationFuture.get(long time, TimeUnit timeunit) on an 
operation.
2. The get(time,timeunit) call needs to "timeout" before the I/O thread has a 
chance to call getNextWritableOp() on TCPMemcachedNodeImpl.
3. The global default operation timeout needs to be greater than the timeout 
specified in the OperationFuture.ge(time, timeunit) call.

This will basically kill the MemcachedClient because the operation gets "stuck" 
in the operation queue and basically kills the connection.

The problematic code is in 
TCPMemcachedNodeImpl.getNextWritableOp()

Assumption:
 A Operation has been added to the operation queue, but the OperationFuture.get(time, timeunit) has timed out and called timeOut() on the Operation before the operation is processed by getNextWriteableOp().

The code gets the currentWriteOp()
The code check to see if it is WRITE_QUEUED. (Which it will be even if 
.timeOut() has been called on the operation)
Then it checks to see if it was cancelled()  (It wasn't  it just timed out)
Then it calls Operation.isTimeOut(long timeout) which checks to see if the 
"global timeout" has been reached, which is hasn't, but the Operation is timed 
out so we throw a IllegalStateException. Which causes the I/O thread to 
re-connect!  But since the operation is already in the writeQ it's TTL is never 
reset so no matter how many times we reconnect the bogus operation is still 
there and makes the MemcachedNode useless.

I think a work around is to call .cancel on the OperationFuture if it times 
out. This should cause at most one reconnect.

I think the real solution is to not have Operation.isTimeOut(long timeout) 
throw the illegal state exception or
have getNextWriteableOp() check to see if the operation isTimedOut() before 
calling .isTimeOut(long timeout) (I think these are basically the same thing)

Original comment by tedoc2...@gmail.com on 2 May 2012 at 12:50

GoogleCodeExporter commented 9 years ago

Original comment by ingen...@gmail.com on 2 May 2012 at 12:53

GoogleCodeExporter commented 9 years ago
I've looked through this, and it seems in the current codebase the 'real 
solution' you recommend is in there, or I've misread what you're recommending.  
Is there any chance you can publish a diff of what you're thinking?

Original comment by ingen...@gmail.com on 19 Aug 2012 at 8:25

GoogleCodeExporter commented 9 years ago
Hrm.. I don't see that solution.. The code looks the same to me as when I filed 
my comment..

There is still no check AFAICT to make sure that the IllegalStateException is 
not thrown when the operation has been timed-out via OperationFuture.get(time. 
timeunit), but before the operation queue can get to it.

Here is the diff I propose:

diff --git a/src/main/java/net/spy/memcached/protocol/TCPMemcachedNodeImpl.java 
b/src/main/java/net/spy/memcached/protocol/TCPMemcachedNodeImpl.java
index 6733c9d..f4f24c8 100644
--- a/src/main/java/net/spy/memcached/protocol/TCPMemcachedNodeImpl.java
+++ b/src/main/java/net/spy/memcached/protocol/TCPMemcachedNodeImpl.java
@@ -234,7 +234,7 @@ public abstract class TCPMemcachedNodeImpl extends 
SpyObject implements
           getLogger().debug("Not writing cancelled op.");
           Operation cancelledOp = removeCurrentWriteOp();
           assert o == cancelledOp;
-        } else if (o.isTimedOut(defaultOpTimeout)) {
+        } else if (o.isTimedOut() || o.isTimedOut(defaultOpTimeout)) {
           getLogger().debug("Not writing timed out op.");
           Operation timedOutOp = removeCurrentWriteOp();
           assert o == timedOutOp;

Original comment by tedoc2...@gmail.com on 20 Aug 2012 at 2:55

GoogleCodeExporter commented 9 years ago
We just ran into this and it was nasty. Can tedoc2...'s diff please be merged 
in?

Original comment by shik...@schmizz.net on 4 Jan 2013 at 8:06

GoogleCodeExporter commented 9 years ago
I'm using couchbase-client-1.1.0 on top of spymemcached-2.8.9 and think I just 
ran into a similar issue.  It's not caused by the NPE, but by an 
IllegalArgumentException being thrown. It's the same IllegalArgumentException 
referenced by oyh.da.  

From the logs, it seems like first this exception is being thrown:

Exception in thread "Memcached IO over {MemcachedConnection to 
server1.blah.com/10.1.128.1:11210 server2.blah.com/10.1.128.1:11210 
server3.blah.com/10.1.128.1:11210}" java.lang.IllegalArgumentException: 
Operation has already timed out; ttl specified would allow it to be valid.
        at net.spy.memcached.protocol.BaseOperationImpl.isTimedOut(BaseOperationImpl.java:215)
        at net.spy.memcached.protocol.TCPMemcachedNodeImpl.getNextWritableOp(TCPMemcachedNodeImpl.java:235)
        at net.spy.memcached.protocol.TCPMemcachedNodeImpl.fillWriteBuffer(TCPMemcachedNodeImpl.java:211)
        at net.spy.memcached.MemcachedConnection.handleWrites(MemcachedConnection.java:424)
        at net.spy.memcached.MemcachedConnection.handleInputQueue(MemcachedConnection.java:312)
        at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:207)
        at com.couchbase.client.CouchbaseConnection.run(CouchbaseConnection.java:229)

and many IllegalStateException's follow that, seemingly every call to the 
MemcachedClient. 

java.lang.IllegalStateException: Timed out waiting to add Cmd: 1 Opaque: 
23782133 Key: blah1234 Cas: 0 Exp: 3600 Flags: 0 Data Length: 47(max 
wait=250ms)
        at net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:344)
        at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:700)
        at com.couchbase.client.CouchbaseConnection.addOperation(CouchbaseConnection.java:188)
        at net.spy.memcached.MemcachedConnection.enqueueOperation(MemcachedConnection.java:642)
        at net.spy.memcached.MemcachedClient.asyncStore(MemcachedClient.java:301)
        at net.spy.memcached.MemcachedClient.set(MemcachedClient.java:704)

I'm using a couchbase cluster with 12 nodes and approximately 12 clients and 
peak around 10,000 requests/sec on the cluster. The connection info in the 
error above is changed to be simplified. 

From the comments above it was suggested that this may be related to the values 
of the 'global' timeout value as compared to the values passed on the 
OperationFuture.  When creating the CouchbaseConnection, I specify a value of 
250 millis for the "OpQueueMaxBlockTime" value.  By default it was 10-seconds 
and this was causing our threads to hang for too long when the client got 
backed up, so I changed it to 250 millis.  On some of the OperationFuture's 
returned, 50 millis are being passed as the timeout value.  I haven't specified 
a value for the OpTimeout, so it should be the default 2500 millis. 

Is the problem expected to be solved already? Unrelated? Issues 191,192, and 
258 also seemed related to this issue, but I chose this thread because the root 
exception was referenced here.  The general problem seems to be that exceptions 
are being thrown that are causing the I/O processing thread to die. Would 
changing the timeout values around so that the value passed to the 
OperationFuture be at least as high as the global timeout value be expected to 
stop this issue from happening?

Original comment by mvand...@gmail.com on 15 Aug 2013 at 2:06