Fatal error: OperationTimeoutException-s followed by dreadful IO Thread is not running

GoogleCodeExporter commented 9 years ago

We're using 5 nodes of CouchBase 1.8.1 community edition (build-937) running on 
CentOS 6.4 (Intel E3-1245 Quad, 32 Gb RAM, 2 x 3 TB SATA 6 Gb/s HDD
7200 rpm) and connecting via Java client 1.1.9 from Linux nodes in the same 
subnetwork.

Client makes about 10000 requests per second to the cluster (Couchbase 
buckets), most of them are get-s and getbulk-s.

At some moment we started getting lots of messages in the logs:
net.spy.memcached.OperationTimeoutException: Timeout waiting for bulk values: 
waited 100 ms. Node status: Connection Status { x.x.x.210:11210 active: true, 
authed: true, last read: 894 ms ago x.x.x.198:11210 active: true, authed: true, 
last read: 1,602 ms ago x.x.x.205:11210 active: true, authed: true, last read: 
100 ms ago x.x.x.114:11210 active: true, authed: true, last read: 313 ms ago }

Then these messages stopped and different ones appeared:
java.lang.AssertionError: IO Thread is not running.
 at net.spy.memcached.MemcachedConnection.checkState(MemcachedConnection.java:954)
 at net.spy.memcached.MemcachedClient.asyncGetBulk(MemcachedClient.java:1244)
 at net.spy.memcached.MemcachedClient.asyncGetBulk(MemcachedClient.java:1280)
 at net.spy.memcached.MemcachedClient.getBulk(MemcachedClient.java:1425)
 at net.spy.memcached.MemcachedClient.getBulk(MemcachedClient.java:1469)

Since that time no single operation was successful so we had to restart Java 
application to fix the problem.

Original issue reported on code.google.com by ksafo...@rutarget.ru on 18 Oct 2013 at 10:14

GoogleCodeExporter commented 9 years ago

Update: yesterday we might have some problems with connectivity between the 
nodes. But in such case client IO thread should not be affected so much?

Original comment by ksafo...@rutarget.ru on 18 Oct 2013 at 10:15

GoogleCodeExporter commented 9 years ago

Hi, just to check if there's any progress on this? This is very critical for us 
as it makes whole application inoperable until restart.

Original comment by ksafo...@rutarget.ru on 7 Nov 2013 at 11:07

GoogleCodeExporter commented 9 years ago

Can you supply a full log of messages from spymemcached?  There must be a 
missing one since it appears none of these is from IO thread processing.  This 
is all from application threads.  The one that shows what happened to the IO 
thread is the interesting one.

Also, you should probably be using the CouchbaseClient.  It's by the same 
authors (Michael, Mike, myself) and is how we test against Couchbase.

Original comment by ingen...@gmail.com on 7 Nov 2013 at 3:46

Changed state: NeedInfo

GoogleCodeExporter commented 9 years ago

I've sent the logs in a separate e-mail.

Original comment by ksafo...@rutarget.ru on 7 Nov 2013 at 4:18

GoogleCodeExporter commented 9 years ago

Hi guys, just to check if you looked at the logs?

Original comment by ksafo...@rutarget.ru on 12 Nov 2013 at 11:19

GoogleCodeExporter commented 9 years ago

Hi, I think I got more information on the issue. While running our test suite 
locally we encountered StackOverflowException at 
net.spy.memcached.ops.MultiOperationCallback.complete(). I think I found what 
the problem is.

If client operation (e.g. "get") fails, it is retried, i.e. added into 
MemcachedConnection.retryOps list and later redistributed 
(MemcachedConnection.redistributeOperations()). Next, when operation is 
redistributed, it is cloned first (OperationFactory.clone(KeyedOperation)), so 
that new instance of MultiGetOperationCallback class is created and original 
operation's callback is passed there as a delegate. If clone is called many 
times for the same operation (due to continuous server failure) a long chain of 
MultiGetOperationCallback-s will be built. When operation finally succeeds 
(MultiOperationCallback.complete) this chain is executed in an recursive 
fashion ("originalCallback.complete()"). If chain length exceeds maximum stack 
size a SOE will be thrown that terminates MemcachedConnection.handleIO() loop 
and shuts down a whole client application.

The solution I would take is to always extract the ultimate callback to use as 
original one (e.g. in MultiOperationCallback constructor).

Original comment by ksafo...@rutarget.ru on 19 Nov 2013 at 12:45

GoogleCodeExporter commented 9 years ago

Jeah I've seen this coming up some time before and I think we need to address 
that some way. 

Are you sure that this is the root cause of the issue reported originally in 
the ticket? Normally, these funky callback chains should only come up on RETRY 
operations (when they get cloned), which should not happen on a timeout but 
when the vbucket moves (so during failure cases and such).

Original comment by michael....@gmail.com on 20 Nov 2013 at 6:12

GoogleCodeExporter commented 9 years ago

Well I can't be sure as I can hardly reproduce the problem. Last time we've 
seen it in production was when testing the availability of the cluster by hard 
shutting down one of the Couchbase nodes.

Original comment by ksafo...@rutarget.ru on 20 Nov 2013 at 8:11

GoogleCodeExporter commented 9 years ago

Okay that correlates then. I'll try to come up with a proper fix for this in 
the next release hopefully.

Original comment by michael....@gmail.com on 20 Nov 2013 at 8:17

GoogleCodeExporter commented 9 years ago

I've just sent a pull request with the fix I would make.

Original comment by ksafo...@rutarget.ru on 20 Nov 2013 at 8:30

GoogleCodeExporter commented 9 years ago

Thanks for thinking about it, I'd thought about something different.

Did this reportedly fix your issue?

Original comment by michael....@gmail.com on 20 Nov 2013 at 8:36

GoogleCodeExporter commented 9 years ago

First I need to put it to production and run for a couple of days under real 
load. I will get back here then.

Original comment by ksafo...@rutarget.ru on 20 Nov 2013 at 8:44

GoogleCodeExporter commented 9 years ago

please see http://www.couchbase.com/issues/browse/SPY-136

and corresponding

http://review.couchbase.com/#/c/30415/

what do you think?

I'll try to get it into 2.10.3

Original comment by michael....@gmail.com on 20 Nov 2013 at 9:31

nidgupta / spymemcached

Fatal error: OperationTimeoutException-s followed by dreadful IO Thread is not running #286