Open GoogleCodeExporter opened 9 years ago
[deleted comment]
I'm reliably observing the same thing in my load test, though I'm on version
2.6rc1. What I see is that both the write queue and input queues are filled up
(we have 2000 for write queue and 1000 for input queue), and then everything
just seem to stop (queues no longer draining, etc.).
Original comment by mewmewb...@gmail.com
on 19 Aug 2011 at 9:17
I saw this again in my load test. The queue sizes are stuck to:
input queue 20000
read queue 6
write queue 98
...and they are no longer moving at all, even when I removed load altogether.
Original comment by mewmewb...@gmail.com
on 25 Aug 2011 at 8:43
We saw this issue with 2.7.1 under heavy load test.
Original comment by vishy.ka...@gmail.com
on 19 Sep 2011 at 9:07
Can someone post a test that can be used to reproduce this issue?
Original comment by mikewie...@gmail.com
on 20 Sep 2011 at 11:00
I've got the same problem in my production application.
memcached server 1.4.4-3.el6 runs on a Red Hat Enterprise Linux Server release
6.1
and the client (v2.7) runs on a tomcat 6 in a CentOS release 4.8
java.lang.IllegalStateException: Interrupted while waiting to add
net.spy.memcached.protocol.ascii.GetOperationImpl@53cb966
at
net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:
298)
at
net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:739)
at
net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:719)
at net.spy.memcached.MemcachedClient.addOp(MemcachedClient.java:410)
at net.spy.memcached.MemcachedClient.asyncGet(MemcachedClient.java:974)
at net.spy.memcached.MemcachedClient.asyncGet(MemcachedClient.java:988)
... our code after this
I fail to reprocude it in test case event while querying on a local or a
distant server..
Could i suggest to add the Exception root cause in the IllegalStateException
which is thrown in the TCPMemcachedNodeImpl.addOp(Operation) method to continue
investigation ?
Two calls in the addOp method could be interrupted, the authLatch.await() and
the inputQueue.offer().
Statistically, I think it's authLatch.await() call that is interrupted, but
cannot be sure without the InterruptedException initialy thrown.
i.e. :
try {
//.....
if(!inputQueue.offer(op, opQueueMaxBlockTime,
TimeUnit.MILLISECONDS)) {
throw new IllegalStateException("Timed out waiting to add "
+ op + "(max wait=" + opQueueMaxBlockTime + "ms)");
}
} catch(InterruptedException e) {
// Restore the interrupted status
Thread.currentThread().interrupt();
throw new IllegalStateException("Interrupted while waiting to add " + op, e);
}
Original comment by a...@one2team.com
on 27 Sep 2011 at 8:26
It's possible this has been fixed with some thread safety issues uncovered
recently. I think it makes sense to add the additional exception information
as requested.
Original comment by ingen...@gmail.com
on 16 Oct 2011 at 2:03
Is the 2.7.3 release should take advantage of these ?
Original comment by a...@one2team.com
on 24 Nov 2011 at 10:46
I haven't made the change yet. please post a changeset to review.couchbase.com
or send me a patch.
Original comment by ingen...@gmail.com
on 24 Nov 2011 at 5:02
I got the same problem in an application with spymemcached 2.5 and memcached
1.4.3. I can easily reproduce the issue by changing a few classes:
1) Reduce op queue length and queue block time (DefaultConnectionFactory.java).
2) Throw OutOfMemoryError in handleIO() (MemcachedConnection.java)
3) write a benchmark program to fetch and push data (InputQueueStressTest)
Run the benchmark program and A large amount of IllegalExceptions are seen in
console. In the meanwhile, memcached thread stops.
OutOfMemoryError may happen in some situations. The problem is, after system
recovers, how memcached thread is initiated again. So can MemcachedClient
ignore system error?
while(running) {
try {
conn.handleIO();
} catch(IOException e) {
logRunException(e);
} catch(CancelledKeyException e) {
logRunException(e);
} catch(ClosedSelectorException e) {
logRunException(e);
} catch(IllegalStateException e) {
logRunException(e);
} catch (Error e) { // ignore it. System may recover soon or later.
}
Original comment by smilingai2004@gmail.com
on 7 Dec 2011 at 8:03
Attachments:
The root cause is same with Issue 191, although symptom is different. I have
comment on issue 191 and request to escalate priority.
Original comment by smilingai2004@gmail.com
on 8 Dec 2011 at 1:46
Hmm, issue 191 said its fixes are in 2.7.3, and we've upgraded to 2.7.3 but are
still seeing these "total failures".
Original comment by mewmewb...@gmail.com
on 8 Dec 2011 at 2:00
It is not said "for sure". OutOfMemoryError may happen under rare conditions. I
have seen it in our production. In the meanwhile, memcached thread dies and
there is no recovery mechanism when it happens. So I think
MemcachedClient.run() should catch Throwable and log specific error information.
Original comment by smilingai2004@gmail.com
on 8 Dec 2011 at 7:13
Yes, jonathanz's recommendations there I would like to implement.
The thing is, the behavior wouldn't necessarily be that much different. If
we're in OOM, we'll just log a lot and not get real work done with jonathanz's
recommended change.
I'll probably get to this in a few weeks. If anyone has more time in the near
term to get a patch together, I'd gladly accept it.
Original comment by ingen...@gmail.com
on 8 Dec 2011 at 8:20
I have a patch based on 2.7.3. Can you get it into V2.8 preview?
Original comment by smilingai2004@gmail.com
on 8 Dec 2011 at 9:20
Attachments:
Can you post a patch by using git patch. It's difficult to see what you changed
here.
Original comment by mikewie...@gmail.com
on 8 Dec 2011 at 5:56
You can also post your code to our code review site at review.couchbase.org.
Instructions for doing this are here:
http://www.couchbase.org/wiki/display/membase/Contributing+changes
Original comment by mikewie...@gmail.com
on 8 Dec 2011 at 6:02
Post patch
Original comment by mountain...@gmail.com
on 12 Dec 2011 at 6:35
Attachments:
Did this fix make it into a release yet?
Original comment by joshg...@gmail.com
on 16 Feb 2012 at 11:16
It did not yet, will try to look into this soon and provide a snapshot. did
you hit this issue?
Original comment by ingen...@gmail.com
on 17 Feb 2012 at 7:03
I "think" we are encountering this issue (if it isn't it seems very close). We
have been struggling with a condition in which we run into a cascade of the
IllegalStateExceptions in our production environment. We haven't pieced
together the conditions that cause the cause the issue so are hoping someone
can pinpoint the issue better than we have so far.
The stacktrace we see repeated while this is occuring:
java.lang.IllegalStateException: Timed out waiting to add
net.spy.memcached.protocol.binary.GetOperationImpl@7cb95c23(max wait=10000ms)
at net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:292)
at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:611)
at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:591)
at net.spy.memcached.MemcachedClient.addOp(MemcachedClient.java:279)
at net.spy.memcached.MemcachedClient.asyncGet(MemcachedClient.java:799)
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:919)
Original comment by joshg...@gmail.com
on 17 Feb 2012 at 10:50
No, that's different, I believe.
The stack trace there indicates you're trying to do an additional asynchronous
get, but the client is overloaded with asynchronous gets. It looks like you've
done some tuning there.
I'd recommend taking this to the mailing list. If you can send over how you're
setting up the client and a description of the workload, I'll look out for it
there to try to help.
Original comment by ingen...@gmail.com
on 17 Feb 2012 at 10:57
Thanks Matt. I posted a question to the mailing list to see if I can get
guidance on how to avoid this condition. Thanks for the help!
Original comment by joshg...@gmail.com
on 23 Feb 2012 at 9:17
Hello. Are there any updates ? We are seeing this on 2.7.3:
Caused by: java.lang.IllegalStateException: Timed out waiting to add
net.spy.memcached.protocol.ascii.StoreOperationIm
pl@5da920ca(max wait=0ms)
at net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:294)
at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:750)
at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:730)
at net.spy.memcached.MemcachedClient.addOp(MemcachedClient.java:450)
at net.spy.memcached.MemcachedClient.asyncStore(MemcachedClient.java:487)
at net.spy.memcached.MemcachedClient.set(MemcachedClient.java:908)
at net.spy.memcached.MemcachedClient.set(MemcachedClient.java:114)
at net.spy.memcached.PooledMemcachedClient.set(PooledMemcachedClient.java:104)
Original comment by kitt...@gmail.com
on 13 May 2014 at 9:22
Once that happens, the client does not work properly anymore.
Original comment by kitt...@gmail.com
on 13 May 2014 at 9:24
Can you see if it works with the latest version? 2.11.2 .. your is quite dated
by now
http://search.maven.org/#artifactdetails%7Cnet.spy%7Cspymemcached%7C2.11.2%7Cjar
Original comment by michael....@gmail.com
on 13 May 2014 at 9:27
I believe I'm getting the same error in 2.11.2. One or two of our production
servers will get into this state every day or so resulting in a total lockup of
the client.
Error message
java.lang.IllegalStateException: Timed out waiting to add Cmd: 0 Opaque:
10073687 Key: v2.silo_map.silo_id_22(max wait=10000ms)
Stack trace
net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp (TCPMemcachedNodeImpl.java:362)
net.spy.memcached.MemcachedConnection.addOperation (MemcachedConnection.java:1228)
net.spy.memcached.MemcachedConnection.addOperation (MemcachedConnection.java:1194)
net.spy.memcached.MemcachedConnection.enqueueOperation (MemcachedConnection.java:1150)
net.spy.memcached.MemcachedClient.asyncGet (MemcachedClient.java:1045)
net.spy.memcached.MemcachedClient.get (MemcachedClient.java:1230)
net.spy.memcached.MemcachedClient.get (MemcachedClient.java:1257)
Original comment by js...@photobucket.com
on 1 Jul 2014 at 4:05
The attached unit test should show how to reproduce the issue. Basically, set
the queue sizes arbitrarily small (4 on my laptop) and then throw 5000 reads
into Callable objects and throw them at an executor service.
Sometimes, you'll see this exception:
java.lang.IllegalStateException: Queue full
at java.util.AbstractQueue.add(AbstractQueue.java:98) ~[na:1.7.0_51]
at java.util.concurrent.ArrayBlockingQueue.add(ArrayBlockingQueue.java:283) ~[na:1.7.0_51]
at net.spy.memcached.protocol.TCPMemcachedNodeImpl.getNextWritableOp(TCPMemcachedNodeImpl.java:251) ~[spymemcached-2.11.2.jar:2.11.2]
at net.spy.memcached.protocol.TCPMemcachedNodeImpl.fillWriteBuffer(TCPMemcachedNodeImpl.java:220) ~[spymemcached-2.11.2.jar:2.11.2]
at net.spy.memcached.MemcachedConnection.handleWrites(MemcachedConnection.java:760) [spymemcached-2.11.2.jar:2.11.2]
at net.spy.memcached.MemcachedConnection.handleReadsAndWrites(MemcachedConnection.java:688) [spymemcached-2.11.2.jar:2.11.2]
at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:648) [spymemcached-2.11.2.jar:2.11.2]
at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:419) [spymemcached-2.11.2.jar:2.11.2]
at net.spy.memcached.MemcachedConnection.run(MemcachedConnection.java:1403) [spymemcached-2.11.2.jar:2.11.2]
Other times you'll see ours:
java.lang.IllegalStateException: Timed out waiting to add Cmd: 0 Opaque: 97
Key: test.key(max wait=10000ms)
at net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:362) ~[spymemcached-2.11.2.jar:2.11.2]
at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:1228) ~[spymemcached-2.11.2.jar:2.11.2]
at net.spy.memcached.MemcachedConnection.redistributeOperation(MemcachedConnection.java:1012) ~[spymemcached-2.11.2.jar:2.11.2]
at net.spy.memcached.MemcachedConnection.redistributeOperations(MemcachedConnection.java:984) ~[spymemcached-2.11.2.jar:2.11.2]
at net.spy.memcached.MemcachedConnection.queueReconnect(MemcachedConnection.java:956) ~[spymemcached-2.11.2.jar:2.11.2]
at net.spy.memcached.MemcachedConnection.lostConnection(MemcachedConnection.java:599) ~[spymemcached-2.11.2.jar:2.11.2]
at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:668) ~[spymemcached-2.11.2.jar:2.11.2]
at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:419) ~[spymemcached-2.11.2.jar:2.11.2]
at net.spy.memcached.MemcachedConnection.run(MemcachedConnection.java:1403) ~[spymemcached-2.11.2.jar:2.11.2]
I haven't dug any deeper to see if threads are hung or some sort of loop or
race is achieved. But in the event of my unit test the program never
terminates... Though it looks like infinite retries:
2014-07-01 21:53:31 pool-1-thread-2 [31mWARN [0;39m
[36mn.s.m.MemcachedConnection[0;39m - Could not redistribute to another node,
retrying primary node for test.key.
Original comment by js...@photobucket.com
on 2 Jul 2014 at 3:55
Attachments:
Could you try with 2.11.4? We made some improvements in that area
Original comment by michael....@gmail.com
on 2 Jul 2014 at 6:01
Confirmed the same behavior exists in 2.11.4.
Original comment by js...@photobucket.com
on 2 Jul 2014 at 2:49
Original comment by ingen...@gmail.com
on 2 Jul 2014 at 10:26
Any update?
Original comment by jeffrey....@gmail.com
on 17 Sep 2014 at 2:14
I found an excellent workaround to this issue: use xmemcached instead of
spymemcached.
It's truly embarrassing that this defect has been around for 3 years and you've
done nothing other than ask users to upgrade and retest. The more and more
traffic we've thrown at spymemcached the more often this issue comes up. Last
week, we got to the point where we saw this issue creep up several times
throughout the day across our production servers which required a complete
restart of our webapp.
As long as the defect exists, this project will never be appropriate for a high
availability production environment.
Original comment by js...@photobucket.com
on 26 Sep 2014 at 2:37
I am using xmemcached too because of this issue. Too bad, spymemcached looked
better initially, xmemcached is a documentation mess... but at least it works.
Original comment by floz...@gmail.com
on 26 Sep 2014 at 2:38
PD: I disagree with the tone of comment #33 . I am thankful for you guys
releasing spymemcached as opensource ... it's just not suitable for me until
this is fixed.
Original comment by floz...@gmail.com
on 26 Sep 2014 at 2:41
PD: I think the reason we think the memcached client is hung is because the
threads are waiting for 10seconds (default value) for item to add to the input
queue when the queue is full. By reducing the OpQueueMaxBlockTime to something
way smaller (say 100ms), your defect test program should terminate properly
with many time outs exception under high load but I think that is normal. The
following statement adds the call to set the OpQueueMaxBlockTime.
ConnectionFactory factory = new ConnectionFactoryBuilder()
.setProtocol(ConnectionFactoryBuilder.Protocol.BINARY)
.setDaemon(true)
.setOpTimeout(1000)
.setOpQueueFactory(new ArrayOperationQueueFactory(4))
.setReadOpQueueFactory(new ArrayOperationQueueFactory(4))
.setWriteOpQueueFactory(new ArrayOperationQueueFactory(4))
.setTranscoder(new SerializingTranscoder())
.setLocatorType(ConnectionFactoryBuilder.Locator.ARRAY_MOD)
.setHashAlg(HashAlgorithmRegistry.lookupHashAlgorithm("Ketama"))
--> .setOpQueueMaxBlockTime(100) <--
.build();
Original comment by IAM...@gmail.com
on 9 Oct 2014 at 8:54
Original issue reported on code.google.com by
fiz...@gmail.com
on 1 Aug 2011 at 3:15