Rare total failure caused by input queue full

GoogleCodeExporter commented 8 years ago

What version of the product are you using? On what operating system?
Memcache 2.7 on Linux (EC2 instance using the Ubuntu Maverick AMI ami-08f40561)

Tell me more...

I think that there is a state the memcache client can get into where all 
network traffic ceases and all calls hang for ten seconds, then throw 
IllegalStateException.

I got our program into this state twice over the weekend by running some 
aggressive loadtests with some memcache servers that were ~ 0.5ms away from the 
client machine; I have not since then been able to reproduce this failure in an 
environment where I could hook up a debugger to see what's going on.

The observed exception is this:

java.lang.IllegalStateException: Timed out waiting to add 
net.spy.memcached.protocol.ascii.StoreOperationImpl@727b65b4(max wait=10000ms)
at 
net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:
292)
at 
net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:739)
at 
net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:719)
at net.spy.memcached.MemcachedClient.addOp(MemcachedClient.java:410)
at net.spy.memcached.MemcachedClient.asyncStore(MemcachedClient.java:447)
at net.spy.memcached.MemcachedClient.set(MemcachedClient.java:868)
... (our code after this)

Meanwhile, the IO thread is sitting waiting for the Selector to select 
something. (at 
net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:252) 
The IO thread never wakes up.

Once the client enters this state, it's all over: every memcache client 
operation times out because the input queue is full. Because the add call in 
TCPMemcachedNodeImpl can't add to the queue, the selector is never woken up, 
and the client never recovers. The only thing to do in this case is shutdown() 
the current memcache client and throw it away.

It might be possible to fix this with a judicious application of finally inside 
MemcachedConnection.addOperation().

Original issue reported on code.google.com by fiz...@gmail.com on 1 Aug 2011 at 3:15

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

I'm reliably observing the same thing in my load test, though I'm on version 
2.6rc1. What I see is that both the write queue and input queues are filled up 
(we have 2000 for write queue and 1000 for input queue), and then everything 
just seem to stop (queues no longer draining, etc.).

Original comment by mewmewb...@gmail.com on 19 Aug 2011 at 9:17

GoogleCodeExporter commented 8 years ago

I saw this again in my load test. The queue sizes are stuck to:

input queue 20000
read queue 6
write queue 98

...and they are no longer moving at all, even when I removed load altogether.

Original comment by mewmewb...@gmail.com on 25 Aug 2011 at 8:43

GoogleCodeExporter commented 8 years ago

We saw this issue with 2.7.1 under heavy load test.

Original comment by vishy.ka...@gmail.com on 19 Sep 2011 at 9:07

GoogleCodeExporter commented 8 years ago

Can someone post a test that can be used to reproduce this issue?

Original comment by mikewie...@gmail.com on 20 Sep 2011 at 11:00

GoogleCodeExporter commented 8 years ago

I've got the same problem in my production application.

memcached server 1.4.4-3.el6 runs on a Red Hat Enterprise Linux Server release 
6.1
and the client (v2.7) runs on a tomcat 6 in a CentOS release 4.8

java.lang.IllegalStateException: Interrupted while waiting to add 
net.spy.memcached.protocol.ascii.GetOperationImpl@53cb966
at 
net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:
298)
at 
net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:739)
at 
net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:719)
at net.spy.memcached.MemcachedClient.addOp(MemcachedClient.java:410)
at net.spy.memcached.MemcachedClient.asyncGet(MemcachedClient.java:974)
at net.spy.memcached.MemcachedClient.asyncGet(MemcachedClient.java:988)
... our code after this

I fail to reprocude it in test case event while querying on a local or a 
distant server..

Could i suggest to add the Exception root cause in the IllegalStateException
which is thrown in the TCPMemcachedNodeImpl.addOp(Operation) method to continue 
investigation ?

Two calls in the addOp method could be interrupted, the authLatch.await() and 
the inputQueue.offer().
Statistically, I think it's authLatch.await() call that is interrupted, but 
cannot be sure without the InterruptedException initialy thrown.

i.e. :

    try {
    //.....
        if(!inputQueue.offer(op, opQueueMaxBlockTime,
                TimeUnit.MILLISECONDS)) {
            throw new IllegalStateException("Timed out waiting to add "
                    + op + "(max wait=" + opQueueMaxBlockTime + "ms)");
        }
    } catch(InterruptedException e) {
        // Restore the interrupted status
        Thread.currentThread().interrupt();
        throw new IllegalStateException("Interrupted while waiting to add " + op, e);
    }

Original comment by a...@one2team.com on 27 Sep 2011 at 8:26

GoogleCodeExporter commented 8 years ago

It's possible this has been fixed with some thread safety issues uncovered 
recently.  I think it makes sense to add the additional exception information 
as requested.

Original comment by ingen...@gmail.com on 16 Oct 2011 at 2:03

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

Is the 2.7.3 release should take advantage of these ?

Original comment by a...@one2team.com on 24 Nov 2011 at 10:46

GoogleCodeExporter commented 8 years ago

I haven't made the change yet.  please post a changeset to review.couchbase.com 
or send me a patch.

Original comment by ingen...@gmail.com on 24 Nov 2011 at 5:02

GoogleCodeExporter commented 8 years ago

I got the same problem in an application with spymemcached 2.5 and memcached 
1.4.3. I can easily reproduce the issue by changing a few classes:
1) Reduce op queue length and queue block time (DefaultConnectionFactory.java).
2) Throw OutOfMemoryError in handleIO() (MemcachedConnection.java)
3) write a benchmark program to fetch and push data (InputQueueStressTest)
Run the benchmark program and A large amount of IllegalExceptions are seen in 
console. In the meanwhile, memcached thread stops.

OutOfMemoryError may happen in some situations. The problem is, after system 
recovers, how memcached thread is initiated again. So can MemcachedClient 
ignore system error?
        while(running) {
            try {
                conn.handleIO();                
            } catch(IOException e) {
                logRunException(e);
            } catch(CancelledKeyException e) {
                logRunException(e);
            } catch(ClosedSelectorException e) {
                logRunException(e);
            } catch(IllegalStateException e) {
                logRunException(e);
            }  catch (Error e) { // ignore it. System may recover soon or later.

            }

Original comment by smilingai2004@gmail.com on 7 Dec 2011 at 8:03

Attachments:

GoogleCodeExporter commented 8 years ago

The root cause is same with Issue 191, although symptom is different. I have 
comment on issue 191 and request to escalate priority.

Original comment by smilingai2004@gmail.com on 8 Dec 2011 at 1:46

GoogleCodeExporter commented 8 years ago

Hmm, issue 191 said its fixes are in 2.7.3, and we've upgraded to 2.7.3 but are 
still seeing these "total failures".

Original comment by mewmewb...@gmail.com on 8 Dec 2011 at 2:00

GoogleCodeExporter commented 8 years ago

It is not said "for sure". OutOfMemoryError may happen under rare conditions. I 
have seen it in our production. In the meanwhile, memcached thread dies and 
there is no recovery mechanism when it happens. So I think 
MemcachedClient.run() should catch Throwable and log specific error information.

Original comment by smilingai2004@gmail.com on 8 Dec 2011 at 7:13

GoogleCodeExporter commented 8 years ago

Yes, jonathanz's recommendations there I would like to implement.

The thing is, the behavior wouldn't necessarily be that much different.  If 
we're in OOM, we'll just log a lot and not get real work done with jonathanz's 
recommended change.  

I'll probably get to this in a few weeks.  If anyone has more time in the near 
term to get a patch together, I'd gladly accept it.

Original comment by ingen...@gmail.com on 8 Dec 2011 at 8:20

GoogleCodeExporter commented 8 years ago

I have a patch based on 2.7.3. Can you get it into V2.8 preview?

Original comment by smilingai2004@gmail.com on 8 Dec 2011 at 9:20

Attachments:

MemcachedClient.java

GoogleCodeExporter commented 8 years ago

Can you post a patch by using git patch. It's difficult to see what you changed 
here.

Original comment by mikewie...@gmail.com on 8 Dec 2011 at 5:56

GoogleCodeExporter commented 8 years ago

You can also post your code to our code review site at review.couchbase.org. 
Instructions for doing this are here: 
http://www.couchbase.org/wiki/display/membase/Contributing+changes

Original comment by mikewie...@gmail.com on 8 Dec 2011 at 6:02

GoogleCodeExporter commented 8 years ago

Post patch

Original comment by mountain...@gmail.com on 12 Dec 2011 at 6:35

Attachments:

fix_full_inputqueue.patch

GoogleCodeExporter commented 8 years ago

Did this fix make it into a release yet?

Original comment by joshg...@gmail.com on 16 Feb 2012 at 11:16

GoogleCodeExporter commented 8 years ago

It did not yet, will try to look into this soon and provide a snapshot.  did 
you hit this issue?

Original comment by ingen...@gmail.com on 17 Feb 2012 at 7:03

GoogleCodeExporter commented 8 years ago

I "think" we are encountering this issue (if it isn't it seems very close).  We 
have been struggling with a condition in which we run into a cascade of the 
IllegalStateExceptions in our production environment.  We haven't pieced 
together the conditions that cause the cause the issue so are hoping someone 
can pinpoint the issue better than we have so far.

The stacktrace we see repeated while this is occuring:

java.lang.IllegalStateException: Timed out waiting to add
net.spy.memcached.protocol.binary.GetOperationImpl@7cb95c23(max wait=10000ms)
        at net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:292)
        at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:611)
        at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:591)
        at net.spy.memcached.MemcachedClient.addOp(MemcachedClient.java:279)
        at net.spy.memcached.MemcachedClient.asyncGet(MemcachedClient.java:799)
        at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:919)

Original comment by joshg...@gmail.com on 17 Feb 2012 at 10:50

GoogleCodeExporter commented 8 years ago

No, that's different, I believe.

The stack trace there indicates you're trying to do an additional asynchronous 
get, but the client is overloaded with asynchronous gets.  It looks like you've 
done some tuning there.  

I'd recommend taking this to the mailing list.  If you can send over how you're 
setting up the client and a description of the workload, I'll look out for it 
there to try to help.

Original comment by ingen...@gmail.com on 17 Feb 2012 at 10:57

GoogleCodeExporter commented 8 years ago

Thanks Matt.  I posted a question to the mailing list to see if I can get 
guidance on how to avoid this condition.  Thanks for the help!

Original comment by joshg...@gmail.com on 23 Feb 2012 at 9:17

GoogleCodeExporter commented 8 years ago

Hello. Are there any updates ? We are seeing this on 2.7.3:

Caused by: java.lang.IllegalStateException: Timed out waiting to add 
net.spy.memcached.protocol.ascii.StoreOperationIm
pl@5da920ca(max wait=0ms)
        at net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:294)
        at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:750)
        at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:730)
        at net.spy.memcached.MemcachedClient.addOp(MemcachedClient.java:450)
        at net.spy.memcached.MemcachedClient.asyncStore(MemcachedClient.java:487)
        at net.spy.memcached.MemcachedClient.set(MemcachedClient.java:908)
        at net.spy.memcached.MemcachedClient.set(MemcachedClient.java:114)
        at net.spy.memcached.PooledMemcachedClient.set(PooledMemcachedClient.java:104)

Original comment by kitt...@gmail.com on 13 May 2014 at 9:22

GoogleCodeExporter commented 8 years ago

Once that happens, the client does not work properly anymore.

Original comment by kitt...@gmail.com on 13 May 2014 at 9:24

GoogleCodeExporter commented 8 years ago

Can you see if it works with the latest version? 2.11.2 .. your is quite dated 
by now

http://search.maven.org/#artifactdetails%7Cnet.spy%7Cspymemcached%7C2.11.2%7Cjar

Original comment by michael....@gmail.com on 13 May 2014 at 9:27

GoogleCodeExporter commented 8 years ago

I believe I'm getting the same error in 2.11.2. One or two of our production 
servers will get into this state every day or so resulting in a total lockup of 
the client.

Error message
java.lang.IllegalStateException: Timed out waiting to add Cmd: 0 Opaque: 
10073687 Key: v2.silo_map.silo_id_22(max wait=10000ms)

Stack trace
  net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp (TCPMemcachedNodeImpl.java:362)

     net.spy.memcached.MemcachedConnection.addOperation (MemcachedConnection.java:1228)

     net.spy.memcached.MemcachedConnection.addOperation (MemcachedConnection.java:1194)

 net.spy.memcached.MemcachedConnection.enqueueOperation (MemcachedConnection.java:1150)

             net.spy.memcached.MemcachedClient.asyncGet (MemcachedClient.java:1045)

                  net.spy.memcached.MemcachedClient.get (MemcachedClient.java:1230)

                  net.spy.memcached.MemcachedClient.get (MemcachedClient.java:1257)

Original comment by js...@photobucket.com on 1 Jul 2014 at 4:05

GoogleCodeExporter commented 8 years ago

The attached unit test should show how to reproduce the issue. Basically, set 
the queue sizes arbitrarily small (4 on my laptop) and then throw 5000 reads 
into Callable objects and throw them at an executor service.

Sometimes, you'll see this exception:
java.lang.IllegalStateException: Queue full
    at java.util.AbstractQueue.add(AbstractQueue.java:98) ~[na:1.7.0_51]
    at java.util.concurrent.ArrayBlockingQueue.add(ArrayBlockingQueue.java:283) ~[na:1.7.0_51]
    at net.spy.memcached.protocol.TCPMemcachedNodeImpl.getNextWritableOp(TCPMemcachedNodeImpl.java:251) ~[spymemcached-2.11.2.jar:2.11.2]
    at net.spy.memcached.protocol.TCPMemcachedNodeImpl.fillWriteBuffer(TCPMemcachedNodeImpl.java:220) ~[spymemcached-2.11.2.jar:2.11.2]
    at net.spy.memcached.MemcachedConnection.handleWrites(MemcachedConnection.java:760) [spymemcached-2.11.2.jar:2.11.2]
    at net.spy.memcached.MemcachedConnection.handleReadsAndWrites(MemcachedConnection.java:688) [spymemcached-2.11.2.jar:2.11.2]
    at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:648) [spymemcached-2.11.2.jar:2.11.2]
    at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:419) [spymemcached-2.11.2.jar:2.11.2]
    at net.spy.memcached.MemcachedConnection.run(MemcachedConnection.java:1403) [spymemcached-2.11.2.jar:2.11.2]

Other times you'll see ours:
java.lang.IllegalStateException: Timed out waiting to add Cmd: 0 Opaque: 97 
Key: test.key(max wait=10000ms)
    at net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:362) ~[spymemcached-2.11.2.jar:2.11.2]
    at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:1228) ~[spymemcached-2.11.2.jar:2.11.2]
    at net.spy.memcached.MemcachedConnection.redistributeOperation(MemcachedConnection.java:1012) ~[spymemcached-2.11.2.jar:2.11.2]
    at net.spy.memcached.MemcachedConnection.redistributeOperations(MemcachedConnection.java:984) ~[spymemcached-2.11.2.jar:2.11.2]
    at net.spy.memcached.MemcachedConnection.queueReconnect(MemcachedConnection.java:956) ~[spymemcached-2.11.2.jar:2.11.2]
    at net.spy.memcached.MemcachedConnection.lostConnection(MemcachedConnection.java:599) ~[spymemcached-2.11.2.jar:2.11.2]
    at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:668) ~[spymemcached-2.11.2.jar:2.11.2]
    at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:419) ~[spymemcached-2.11.2.jar:2.11.2]
    at net.spy.memcached.MemcachedConnection.run(MemcachedConnection.java:1403) ~[spymemcached-2.11.2.jar:2.11.2]

I haven't dug any deeper to see if threads are hung or some sort of loop or 
race is achieved. But in the event of my unit test the program never 
terminates... Though it  looks like infinite retries:

2014-07-01 21:53:31 pool-1-thread-2 [31mWARN [0;39m 
[36mn.s.m.MemcachedConnection[0;39m - Could not redistribute to another node, 
retrying primary node for test.key.

Original comment by js...@photobucket.com on 2 Jul 2014 at 3:55

Attachments:

MemcachedDefectTest.java

GoogleCodeExporter commented 8 years ago

Could you try with 2.11.4? We made some improvements in that area

Original comment by michael....@gmail.com on 2 Jul 2014 at 6:01

GoogleCodeExporter commented 8 years ago

Confirmed the same behavior exists in 2.11.4.

Original comment by js...@photobucket.com on 2 Jul 2014 at 2:49

GoogleCodeExporter commented 8 years ago

Original comment by ingen...@gmail.com on 2 Jul 2014 at 10:26

GoogleCodeExporter commented 8 years ago

Any update?

Original comment by jeffrey....@gmail.com on 17 Sep 2014 at 2:14

GoogleCodeExporter commented 8 years ago

I found an excellent workaround to this issue: use xmemcached instead of 
spymemcached.

It's truly embarrassing that this defect has been around for 3 years and you've 
done nothing other than ask users to upgrade and retest. The more and more 
traffic we've thrown at spymemcached the more often this issue comes up. Last 
week, we got to the point where we saw this issue creep up several times 
throughout the day across our production servers which required a complete 
restart of our webapp. 

As long as the defect exists, this project will never be appropriate for a high 
availability production environment.

Original comment by js...@photobucket.com on 26 Sep 2014 at 2:37

GoogleCodeExporter commented 8 years ago

I am using xmemcached too because of this issue. Too bad, spymemcached looked 
better initially, xmemcached is a documentation mess... but at least it works.

Original comment by floz...@gmail.com on 26 Sep 2014 at 2:38

GoogleCodeExporter commented 8 years ago

PD: I disagree with the tone of comment #33 . I am thankful for you guys 
releasing spymemcached as opensource ... it's just not suitable for me until 
this is fixed.

Original comment by floz...@gmail.com on 26 Sep 2014 at 2:41

GoogleCodeExporter commented 8 years ago

PD: I think the reason we think the memcached client is hung is because the 
threads are waiting for 10seconds (default value) for item to add to the input 
queue when the queue is full. By reducing the OpQueueMaxBlockTime to something 
way smaller (say 100ms), your defect test program  should terminate properly 
with many time outs exception under high load but I think that is normal.  The 
following statement adds the call to set the OpQueueMaxBlockTime.  

        ConnectionFactory factory = new ConnectionFactoryBuilder()
                .setProtocol(ConnectionFactoryBuilder.Protocol.BINARY)
                .setDaemon(true)
                .setOpTimeout(1000)
                .setOpQueueFactory(new ArrayOperationQueueFactory(4))
                .setReadOpQueueFactory(new ArrayOperationQueueFactory(4))
                .setWriteOpQueueFactory(new ArrayOperationQueueFactory(4))
                .setTranscoder(new SerializingTranscoder())
                .setLocatorType(ConnectionFactoryBuilder.Locator.ARRAY_MOD)
                .setHashAlg(HashAlgorithmRegistry.lookupHashAlgorithm("Ketama"))
          -->   .setOpQueueMaxBlockTime(100)  <--
                .build();

Original comment by IAM...@gmail.com on 9 Oct 2014 at 8:54

shixin42 / spymemcached

Rare total failure caused by input queue full #192