Closed GoogleCodeExporter closed 9 years ago
Update: yesterday we might have some problems with connectivity between the
nodes. But in such case client IO thread should not be affected so much?
Original comment by ksafo...@rutarget.ru
on 18 Oct 2013 at 10:15
Hi, just to check if there's any progress on this? This is very critical for us
as it makes whole application inoperable until restart.
Original comment by ksafo...@rutarget.ru
on 7 Nov 2013 at 11:07
Can you supply a full log of messages from spymemcached? There must be a
missing one since it appears none of these is from IO thread processing. This
is all from application threads. The one that shows what happened to the IO
thread is the interesting one.
Also, you should probably be using the CouchbaseClient. It's by the same
authors (Michael, Mike, myself) and is how we test against Couchbase.
Original comment by ingen...@gmail.com
on 7 Nov 2013 at 3:46
I've sent the logs in a separate e-mail.
Original comment by ksafo...@rutarget.ru
on 7 Nov 2013 at 4:18
Hi guys, just to check if you looked at the logs?
Original comment by ksafo...@rutarget.ru
on 12 Nov 2013 at 11:19
Hi, I think I got more information on the issue. While running our test suite
locally we encountered StackOverflowException at
net.spy.memcached.ops.MultiOperationCallback.complete(). I think I found what
the problem is.
If client operation (e.g. "get") fails, it is retried, i.e. added into
MemcachedConnection.retryOps list and later redistributed
(MemcachedConnection.redistributeOperations()). Next, when operation is
redistributed, it is cloned first (OperationFactory.clone(KeyedOperation)), so
that new instance of MultiGetOperationCallback class is created and original
operation's callback is passed there as a delegate. If clone is called many
times for the same operation (due to continuous server failure) a long chain of
MultiGetOperationCallback-s will be built. When operation finally succeeds
(MultiOperationCallback.complete) this chain is executed in an recursive
fashion ("originalCallback.complete()"). If chain length exceeds maximum stack
size a SOE will be thrown that terminates MemcachedConnection.handleIO() loop
and shuts down a whole client application.
The solution I would take is to always extract the ultimate callback to use as
original one (e.g. in MultiOperationCallback constructor).
Original comment by ksafo...@rutarget.ru
on 19 Nov 2013 at 12:45
Jeah I've seen this coming up some time before and I think we need to address
that some way.
Are you sure that this is the root cause of the issue reported originally in
the ticket? Normally, these funky callback chains should only come up on RETRY
operations (when they get cloned), which should not happen on a timeout but
when the vbucket moves (so during failure cases and such).
Original comment by michael....@gmail.com
on 20 Nov 2013 at 6:12
Well I can't be sure as I can hardly reproduce the problem. Last time we've
seen it in production was when testing the availability of the cluster by hard
shutting down one of the Couchbase nodes.
Original comment by ksafo...@rutarget.ru
on 20 Nov 2013 at 8:11
Okay that correlates then. I'll try to come up with a proper fix for this in
the next release hopefully.
Original comment by michael....@gmail.com
on 20 Nov 2013 at 8:17
I've just sent a pull request with the fix I would make.
Original comment by ksafo...@rutarget.ru
on 20 Nov 2013 at 8:30
Thanks for thinking about it, I'd thought about something different.
Did this reportedly fix your issue?
Original comment by michael....@gmail.com
on 20 Nov 2013 at 8:36
First I need to put it to production and run for a couple of days under real
load. I will get back here then.
Original comment by ksafo...@rutarget.ru
on 20 Nov 2013 at 8:44
please see http://www.couchbase.com/issues/browse/SPY-136
and corresponding
http://review.couchbase.com/#/c/30415/
what do you think?
I'll try to get it into 2.10.3
Original comment by michael....@gmail.com
on 20 Nov 2013 at 9:31
Original issue reported on code.google.com by
ksafo...@rutarget.ru
on 18 Oct 2013 at 10:14