Open GoogleCodeExporter opened 9 years ago
In 2.6rc1 there was an issue that would cause cancelled operations to block
things. This could be related to that. Will investigate.
Original comment by ingen...@gmail.com
on 4 May 2011 at 10:56
Original comment by ingen...@gmail.com
on 4 May 2011 at 10:57
an NPE thrown out of the MemcachedClient.run() method will cause the memcached
IO thread to die and spy will never succesfully perform another operation
again. I logged issue 191 to get that exception handling cleaned up. I will
post my workaround there.
Original comment by jonat...@gmail.com
on 29 Jul 2011 at 7:06
Original comment by mikewie...@gmail.com
on 7 Sep 2011 at 12:49
I just took a look at this issue and it was fixed by my change to add user
level access to OperationStatus. The issue here was that when operations were
cancelled or timed out that the complete() function could be called without
calling the recievedStatus() function. This cause the status to be null in some
cases.
Please re-open this issue if the problem still persists when using a release
greater than 2.7.
Original comment by mikewie...@gmail.com
on 13 Sep 2011 at 11:25
using 2.8.0,the problem still persists
[2012-02-23 12:37:06.848]2012-02-23 12:37:06.848 INFO
net.spy.memcached.MemcachedConnection: Reconnecting due to exception on {QA
sa=MC1/memcache_host:11211, #Rops=0, #Wops=331
, #iq=0, topRop=null,
topWop=net.spy.memcached.protocol.ascii.GetOperationImpl@196c1cb4, toWrite=0,
interested=4}
[2012-02-23 12:37:06.848]java.lang.NullPointerException
[2012-02-23 12:37:06.848] at
net.spy.memcached.MemcachedClient$6.receivedStatus(MemcachedClient.java:1007)
[2012-02-23 12:37:06.848] at
net.spy.memcached.ops.MultiOperationCallback.complete(MultiOperationCallback.jav
a:32)
[2012-02-23 12:37:06.848] at
net.spy.memcached.protocol.BaseOperationImpl.transitionState(BaseOperationImpl.j
ava:112)
[2012-02-23 12:37:06.848] at
net.spy.memcached.protocol.BaseOperationImpl.isTimedOut(BaseOperationImpl.java:1
73)
[2012-02-23 12:37:06.848] at
net.spy.memcached.protocol.TCPMemcachedNodeImpl.fillWriteBuffer(TCPMemcachedNode
Impl.java:165)
[2012-02-23 12:37:06.848] at
net.spy.memcached.MemcachedConnection.handleWrites(MemcachedConnection.java:365)
[2012-02-23 12:37:06.848] at
net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:327)
[2012-02-23 12:37:06.848] at
net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:199)
[2012-02-23 12:37:06.848] at
net.spy.memcached.MemcachedClient.run(MemcachedClient.java:1622)
[2012-02-23 12:37:06.848]2012-02-23 12:37:06.848 WARN
net.spy.memcached.MemcachedConnection: Closing, and reopening {QA
sa=MC1/10.10.32.163:11211, #Rops=0, #Wops=331, #iq=0, t
opRop=null, topWop=net.spy.memcached.protocol.ascii.GetOperationImpl@196c1cb4,
toWrite=0, interested=4}, attempt 0.
[2012-02-23 12:37:06.848]2012-02-23 12:37:06.848 INFO
net.spy.memcached.protocol.ascii.AsciiMemcachedNodeImpl: No buffer for current
write op, removing
there is the other error before this error
[2012-02-23 12:36:43.980]java.lang.IllegalArgumentException: Operation has
already timed out; ttl specified would allow it to be valid.
[2012-02-23 12:36:43.981] at
net.spy.memcached.protocol.BaseOperationImpl.isTimedOut(BaseOperationImpl.java:1
79)
[2012-02-23 12:36:43.981] at
net.spy.memcached.protocol.TCPMemcachedNodeImpl.fillWriteBuffer(TCPMemcachedNode
Impl.java:165)
[2012-02-23 12:36:43.981] at
net.spy.memcached.MemcachedConnection.handleWrites(MemcachedConnection.java:365)
[2012-02-23 12:36:43.995] at
net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:327)
[2012-02-23 12:36:43.996] at
net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:199)
[2012-02-23 12:36:43.996] at
net.spy.memcached.MemcachedClient.run(MemcachedClient.java:1622)
[2012-02-23 12:36:43.996]2012-02-23 12:36:43.996 WARN
net.spy.memcached.MemcachedConnection: Closing, and reopening {QA
sa=MC1/memcache_host:11211, #Rops=0, #Wops=1, #iq=2, top
Rop=null, topWop=net.spy.memcached.protocol.ascii.GetOperationImpl@54167222,
toWrite=0, interested=4}, attempt 0.
Original comment by oyh.da...@gmail.com
on 23 Feb 2012 at 6:23
I have got this problem for a few days with spymemcached 2.8.0 too. And version
2.5 may be the best choice now.
Original comment by matrix3...@gmail.com
on 27 Feb 2012 at 4:08
I think I have figured out what is happening here.
Three things need to be true
1. Client needs to call OperationFuture.get(long time, TimeUnit timeunit) on an
operation.
2. The get(time,timeunit) call needs to "timeout" before the I/O thread has a
chance to call getNextWritableOp() on TCPMemcachedNodeImpl.
3. The global default operation timeout needs to be greater than the timeout
specified in the OperationFuture.ge(time, timeunit) call.
This will basically kill the MemcachedClient because the operation gets "stuck"
in the operation queue and basically kills the connection.
The problematic code is in
TCPMemcachedNodeImpl.getNextWritableOp()
Assumption:
A Operation has been added to the operation queue, but the OperationFuture.get(time, timeunit) has timed out and called timeOut() on the Operation before the operation is processed by getNextWriteableOp().
The code gets the currentWriteOp()
The code check to see if it is WRITE_QUEUED. (Which it will be even if
.timeOut() has been called on the operation)
Then it checks to see if it was cancelled() (It wasn't it just timed out)
Then it calls Operation.isTimeOut(long timeout) which checks to see if the
"global timeout" has been reached, which is hasn't, but the Operation is timed
out so we throw a IllegalStateException. Which causes the I/O thread to
re-connect! But since the operation is already in the writeQ it's TTL is never
reset so no matter how many times we reconnect the bogus operation is still
there and makes the MemcachedNode useless.
I think a work around is to call .cancel on the OperationFuture if it times
out. This should cause at most one reconnect.
I think the real solution is to not have Operation.isTimeOut(long timeout)
throw the illegal state exception or
have getNextWriteableOp() check to see if the operation isTimedOut() before
calling .isTimeOut(long timeout) (I think these are basically the same thing)
Original comment by tedoc2...@gmail.com
on 2 May 2012 at 12:50
Original comment by ingen...@gmail.com
on 2 May 2012 at 12:53
I've looked through this, and it seems in the current codebase the 'real
solution' you recommend is in there, or I've misread what you're recommending.
Is there any chance you can publish a diff of what you're thinking?
Original comment by ingen...@gmail.com
on 19 Aug 2012 at 8:25
Hrm.. I don't see that solution.. The code looks the same to me as when I filed
my comment..
There is still no check AFAICT to make sure that the IllegalStateException is
not thrown when the operation has been timed-out via OperationFuture.get(time.
timeunit), but before the operation queue can get to it.
Here is the diff I propose:
diff --git a/src/main/java/net/spy/memcached/protocol/TCPMemcachedNodeImpl.java
b/src/main/java/net/spy/memcached/protocol/TCPMemcachedNodeImpl.java
index 6733c9d..f4f24c8 100644
--- a/src/main/java/net/spy/memcached/protocol/TCPMemcachedNodeImpl.java
+++ b/src/main/java/net/spy/memcached/protocol/TCPMemcachedNodeImpl.java
@@ -234,7 +234,7 @@ public abstract class TCPMemcachedNodeImpl extends
SpyObject implements
getLogger().debug("Not writing cancelled op.");
Operation cancelledOp = removeCurrentWriteOp();
assert o == cancelledOp;
- } else if (o.isTimedOut(defaultOpTimeout)) {
+ } else if (o.isTimedOut() || o.isTimedOut(defaultOpTimeout)) {
getLogger().debug("Not writing timed out op.");
Operation timedOutOp = removeCurrentWriteOp();
assert o == timedOutOp;
Original comment by tedoc2...@gmail.com
on 20 Aug 2012 at 2:55
We just ran into this and it was nasty. Can tedoc2...'s diff please be merged
in?
Original comment by shik...@schmizz.net
on 4 Jan 2013 at 8:06
I'm using couchbase-client-1.1.0 on top of spymemcached-2.8.9 and think I just
ran into a similar issue. It's not caused by the NPE, but by an
IllegalArgumentException being thrown. It's the same IllegalArgumentException
referenced by oyh.da.
From the logs, it seems like first this exception is being thrown:
Exception in thread "Memcached IO over {MemcachedConnection to
server1.blah.com/10.1.128.1:11210 server2.blah.com/10.1.128.1:11210
server3.blah.com/10.1.128.1:11210}" java.lang.IllegalArgumentException:
Operation has already timed out; ttl specified would allow it to be valid.
at net.spy.memcached.protocol.BaseOperationImpl.isTimedOut(BaseOperationImpl.java:215)
at net.spy.memcached.protocol.TCPMemcachedNodeImpl.getNextWritableOp(TCPMemcachedNodeImpl.java:235)
at net.spy.memcached.protocol.TCPMemcachedNodeImpl.fillWriteBuffer(TCPMemcachedNodeImpl.java:211)
at net.spy.memcached.MemcachedConnection.handleWrites(MemcachedConnection.java:424)
at net.spy.memcached.MemcachedConnection.handleInputQueue(MemcachedConnection.java:312)
at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:207)
at com.couchbase.client.CouchbaseConnection.run(CouchbaseConnection.java:229)
and many IllegalStateException's follow that, seemingly every call to the
MemcachedClient.
java.lang.IllegalStateException: Timed out waiting to add Cmd: 1 Opaque:
23782133 Key: blah1234 Cas: 0 Exp: 3600 Flags: 0 Data Length: 47(max
wait=250ms)
at net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:344)
at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:700)
at com.couchbase.client.CouchbaseConnection.addOperation(CouchbaseConnection.java:188)
at net.spy.memcached.MemcachedConnection.enqueueOperation(MemcachedConnection.java:642)
at net.spy.memcached.MemcachedClient.asyncStore(MemcachedClient.java:301)
at net.spy.memcached.MemcachedClient.set(MemcachedClient.java:704)
I'm using a couchbase cluster with 12 nodes and approximately 12 clients and
peak around 10,000 requests/sec on the cluster. The connection info in the
error above is changed to be simplified.
From the comments above it was suggested that this may be related to the values
of the 'global' timeout value as compared to the values passed on the
OperationFuture. When creating the CouchbaseConnection, I specify a value of
250 millis for the "OpQueueMaxBlockTime" value. By default it was 10-seconds
and this was causing our threads to hang for too long when the client got
backed up, so I changed it to 250 millis. On some of the OperationFuture's
returned, 50 millis are being passed as the timeout value. I haven't specified
a value for the OpTimeout, so it should be the default 2500 millis.
Is the problem expected to be solved already? Unrelated? Issues 191,192, and
258 also seemed related to this issue, but I chose this thread because the root
exception was referenced here. The general problem seems to be that exceptions
are being thrown that are causing the I/O processing thread to die. Would
changing the timeout values around so that the value passed to the
OperationFuture be at least as high as the global timeout value be expected to
stop this issue from happening?
Original comment by mvand...@gmail.com
on 15 Aug 2013 at 2:06
Original issue reported on code.google.com by
mewmewb...@gmail.com
on 11 Mar 2011 at 10:54