unrecoverable after CheckedOperationTimeoutException

GoogleCodeExporter commented 9 years ago

I'm seeing the following CheckedOperationTimeoutException under medium load
( ~ 100 concurrent reads ) 

net.spy.memcached.internal.CheckedOperationTimeoutException: Timed out
waiting for operation - failing node: /127.0.0.1:11211
        at
net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:59)
        at net.spy.memcached.internal.GetFuture.get(GetFuture.java:37)

client: 2.5 
server: 1.4.4 (Linux)
Timeout is set for 3 sec ( async call )

public <T> T getAfterDisableCheck(String key, Class<T> aclazz) {
        T cacheObj = null;
        try {
            Future<Object> result = this.cacheClient.asyncGet(key);
            cacheObj = (T) result.get(getConfig().getTimeOutInSec(), TimeUnit.SECONDS);
            if (cacheObj != null && !aclazz.isAssignableFrom(cacheObj.getClass())) {
                Cache.MEMCACHED_CLIENT_LOG.error("Invalid expected type set for key: "
+ key + " " + aclazz
                        + " is NOT a instance of " + cacheObj.getClass());
                cacheObj = null;
            }
        } catch (Exception e) {
            Cache.MEMCACHED_CLIENT_LOG.error("Error interacting with memcahce " +
"for key: " + key, e);
        }
        return cacheObj;
    }

I'm using Linux as a client and memcached is access via localhost

Thanks

Original issue reported on code.google.com by S.Ille...@gmail.com on 27 Apr 2010 at 4:30

GoogleCodeExporter commented 9 years ago

Well, it may not be a resurgence of issue 35, as I found one other area of 
testing that needed some fixing.  After fixing that, I still get occasional 
failures, which I am looking into, but I've posted a build that does fix the 
problem I've talked about above.

It's available here:
http://code.google.com/p/spymemcached/downloads/detail?name=memcached-2.5-23-g7b
7ae44.jar&can=2&q=

The javadoc/source are also in downloads:
http://code.google.com/p/spymemcached/downloads/list

The work in progress code is also on github:
https://github.com/ingenthr/java-memcached-client/commits/spy136WIP

It'll need some reformatting/documentation/squashing before it goes into a real 
release, but I think this is going the right direction.

Note that there is a new state for an operation, that of TIMEDOUT.  This and 
the change of the continuious timeout handling will likely trigger us to update 
to version 2.6, but that's without any deprecation of existing API.

Original comment by ingen...@gmail.com on 28 Dec 2010 at 3:43

GoogleCodeExporter commented 9 years ago

I fixed the occasional test issue referred to above and the code I posted 
yesterday had a problem with asynch get operations, leading to all of them 
going the duration of any timeout on the get() of the Future.

The code is available here:
http://code.google.com/p/spymemcached/downloads/detail?name=memcached-2.5-26-g3c
ad4cb.jar&can=2&q=

And just like my last posting, the Javadoc/sources and branch on github have 
been updated.

Original comment by ingen...@gmail.com on 29 Dec 2010 at 5:20

GoogleCodeExporter commented 9 years ago

It appears a few people have downloaded this and are giving it a shot.  Please 
post any feedback.

Also, if you're interested, you can review the (newly cleaned up) code commits 
over here: http://review.membase.org/#q,status:open+project:spymemcached,n,z

Original comment by ingen...@gmail.com on 3 Jan 2011 at 1:14

GoogleCodeExporter commented 9 years ago

Original comment by ingen...@gmail.com on 3 Jan 2011 at 1:20

Changed title: unrecocerable after CheckedOperationTimeoutException
Changed state: Fixed

GoogleCodeExporter commented 9 years ago

We tried the jar given in Comment 52 and we still have exhibited the problem 
using the spymemcached-stress application.  Has this been fixed somewhere?

Original comment by radost...@gmail.com on 2 Feb 2011 at 5:54

GoogleCodeExporter commented 9 years ago

We took the jar dates from 02 Jan and we still encounter timeout on heavy load. 
By the way, it seems that the client dont reconnect itself anymore after the 
restart of memcached (and the cpu rise 100%).

Original comment by cyril.d...@gmail.com on 2 Feb 2011 at 6:59

GoogleCodeExporter commented 9 years ago

Yes, using this client has taken down our server twice.  I'm in the middle of 
swapping out our client with xmemcached, since having our server go down is 
unacceptable.

Original comment by radost...@gmail.com on 2 Feb 2011 at 7:24

GoogleCodeExporter commented 9 years ago

@radost... others have said it has helped and I was going to put together a 
release candidate.  Can you give me some more detail on how it crashed your 
server?

Original comment by ingen...@gmail.com on 4 Feb 2011 at 1:46

GoogleCodeExporter commented 9 years ago

ingen... our server is on the stock 2.5 and threw the 
net.spy.memcached.internal.CheckedOperationTimeoutException error after running 
for a about a week.  And all the threads on the app server backed up even 
though we had a 2 second timeout and there is a global 1 second timeout.  The 
memcache server was unaffected, affect we restarted the app server the 
application continued on fine, so it is the same thing others are seeing where 
the client just can't reconnect.  I would have thought with the timeouts though 
that the server would have continued on - just slower, but the threads were all 
backed up/not responding.  We tested your supplied jar from comment 52 using 
the http://github.com/stevenschlansker/spymemcached-stress application and 
again saw the timeout exceptions.  It took less than a day to see the 
exceptions.  If I have time this weekend I will try it again. Incidentally, 
have you tried bisecting into the offending commit using the 
spymemcached-stress application?

Original comment by radost...@gmail.com on 5 Feb 2011 at 1:35

GoogleCodeExporter commented 9 years ago

@radost...

First a couple of answers to your questions.  I did look at the 
spymemcached-stress app, and found it was much like a built in test that Dustin 
had added to ensure the client recovered after timeouts.  I spent a lot of time 
with that test and my modifications to ensure that after a timeout, the client 
does recover.  As I'd said above, being realistic, it'll be impossible to 
completely remove ever receiving a timeout.  

It sounds like your server is in an environment where you regularly saturate 
your resources and you just want it to continue to handle things, just slower.  
If that's the case, you can increase the timeout time.

Regarding to the "server continuing on, just slower", that's actually not the 
case.  You can hit a timeout for a number of reasons (network issues, JVM 
work).  I would expect when you see a timeout, you'll see a big series of them, 
and then it will stabalize and come back to normal.  There is no built in 
backoff to slow down the app to reduce the number of timeouts seen before 
recovery.  That could be done at the application level.

Finally, on the question of bisecting, there isn't an offending commit per se.  
The latch and timeout were added when the client appeared to hang forever in 
certain failure cases.  It's correct for them to be there, but what I believe 
to be incorrect was the very low default timeout value and the fact that we 
would not count the beginning of the timeout from operation creation, skipping 
sending it over the network if it'd already timed out.  That behavior would 
more likely create a scenario with an extended number of timedout operations, 
rather than a burst of them.

I expect to cover what I've put in this commit in the release notes:
https://github.com/dustin/java-memcached-client/commit/0e1ebdb844b11f141e389ef58
4288a39219512a8

Some of the JVM tuning in that commit message may help you.

Please understand, I really want to fix this and have put quite a bit of time 
into looking into it.  If there's a scenario where the client is breaking 
servers, we'll need to find a better way to handle that.  It sounds like, from 
what you say above, you'll need to turn down the timeout or restrict the number 
of client threads to ensure it doesn't get so overloaded that it won't recover. 
 If you don't handle it specifically in your code, my guess is that a timeout 
should turn into an HTTP 503.

Original comment by ingen...@gmail.com on 7 Feb 2011 at 9:03

GoogleCodeExporter commented 9 years ago

Thanks for your response Ingen...  I was misinterpreting the timeouts I was 
seeing from the spymemcache-stress application as evidence of a continuing 
problem, but I see that it just causes timeouts from high gc.  I have tried to 
replicate the issue with a non-recoverable memcached connection, but in my 
tests the connection does recover.

Original comment by radost...@gmail.com on 9 Feb 2011 at 1:37

GoogleCodeExporter commented 9 years ago

No problem, I'm glad it looks like things look better for you.

It is a good point though, I should try testing this under a webapp in tomcat 
to provide the right kind of pattern for folks.

Original comment by ingen...@gmail.com on 9 Feb 2011 at 5:44

Changed title: unrecoverable after CheckedOperationTimeoutException

GoogleCodeExporter commented 9 years ago

I can verify that this fix makes things better.  We are running spymemcached 
2.4.2 in our production environment and see this issue from time to time.  We 
have recreated it in a simulated performance test and upgrading to the JAR 
referenced above (with the fix for this bug) cuts the number of errors in half 
for us and the system recovered without needing a restart.  I feel comfortable 
attributing the remaining time outs to GC, CPU spikes, etc.

Thanks for fixing this.  We will be testing this extensively and rolling it out 
in 2 weeks.

Original comment by brandon....@gmail.com on 15 Feb 2011 at 10:23

GoogleCodeExporter commented 9 years ago

Excellent, thanks for the validation Brandon.  Ironically, in another window, I 
was just drafting an email about the 2.6rc1 being posted.  If you can test with 
it and provide any feedback, I would appreciate it.

Original comment by ingen...@gmail.com on 15 Feb 2011 at 10:32

GoogleCodeExporter commented 9 years ago

Sure thing.  I've upgraded to 2.6rc1.  If all goes well, we will be taking this 
to production on March 1st.  I'll let you know how it works out there once it 
gets some real stress testing :)

Original comment by brandon....@gmail.com on 16 Feb 2011 at 2:34

GoogleCodeExporter commented 9 years ago

Do this bug fix is included in 2.6rc1?

Original comment by ysob...@gmail.com on 5 Apr 2011 at 6:29

GoogleCodeExporter commented 9 years ago

Yes.  We had unrelated issues on March 1st, so had to postpone this upgrade 
until March 24th.  We've been running with it since.  Under extremely high CPU 
load, I still see timeout errors, but far fewer of them than before.  I think 
the remaining errors are my fault.

What I saw before was the errors themselves perpetuated the high CPU 
utilization.

Original comment by brandon....@gmail.com on 6 Apr 2011 at 7:02

GoogleCodeExporter commented 9 years ago

I am seeing timeouts using the 2.6 version. We have a loader process that 
exclusively does bulk gets / bulk writes (via CacheLoader), and we see the 
issue more there. It doesn't seem to be a real timeout that is the cause of the 
problem, we can put an arbitrarily large value for the global operation timeout 
and still see the issue.

We see the issue sporadically in our other processes that do a more mixed bag 
of memcached operations, and it tends to correct itself faster than did the 
previous version.

Original comment by jonat...@gmail.com on 2 Jun 2011 at 7:58

GoogleCodeExporter commented 9 years ago

@jonat...

The changes in 2.6 won't eliminate timeouts.  I am guessing, since you're doing 
bulk loading, that you may be overwhelming the queues and the VM which leads to 
occasionally going beyond the default 2500ms timeout value.  

Since you've already tuned the operation timeout, I might recommend some VM 
tuning.  From my commit notes:
First, by default, garbage collection times may easily go over 1sec.
Testing with simple toy tests shows this quite clearly, even on
systems with lots of CPUs and a decent amount of memory.  Of course,
much of this can be controlled with GC tuning on the JVM.  With the
hotspot JVM, look to this whitepaper:
http://java.sun.com/j2se/reference/whitepapers/memorymanagement_whitepaper.pdf

Testing showed the following to be particularly useful:
-XX:+UseConcMarkSweepGC -XX:MaxGCPauseMillis=850

There is a CPU time tradeoff for this.

Even with these, testing showed some 1 second timeouts when GCs near a
half a second.  To use this software though, we shouldn't expect people
to have to tune the GC, so raising the default seems like the
right thing to do.

Second, many systems use spymemcached on virtualized or cloud environments.
The processes running there do not have any guarantee of execution
time.  It'd be really unlikely for a thread to be starved for more than
a second, but it is possible and shouldn't make things stop.  Raising this
a bit will help.

Third, and perhaps most importantly, most people run applications on
networks that do not offer any guarantee around response time.  If
the network is oversubscribed or even minor blips occur on the network
can cause TCP retransmissions.  While many TCP implementations ignore
it, RFC 2988 specifies rounding up to 1sec when calculating
TCP retransmit timeouts.  Blips will occur, and rather than force
this seemingly synchronous get to timeout, it may be better to
just wait a bit longer by default.

I really need to extract this bug thread to a FAQ on timeouts

Original comment by ingen...@gmail.com on 4 Jun 2011 at 4:33

GoogleCodeExporter commented 9 years ago

Sorry for waking up the thread after 3 years :) 

We are facing the Timeout issue quite often almost throughout the day. 

What is the size of the op queue, read queue, write queues? 

Does initialising the queues help? Like,

// queue factory
builder.setOpQueueFactory(new ArrayOperationQueueFactory(config.maxQueueSize));
builder.setReadOpQueueFactory(new 
ArrayOperationQueueFactory(config.maxQueueSize));
builder.setWriteOpQueueFactory(new 
ArrayOperationQueueFactory(config.maxQueueSize));

We are using spymemcached 2.8.2 and couch base client 1.2.1

Original comment by sanaulla...@gmail.com on 19 Apr 2014 at 3:50

GoogleCodeExporter commented 9 years ago

@sana...

It depends on the cause of the timeouts and we can't really identify that from 
here.  What I can say is that under 136 we no longer send things that are 
already timed out which allows for sort of a pressure relief valve.  

The important thing with timeouts is to think about what you'd do differently 
after the timeout value.  If you don't have anything to do differently and you 
just really want an answer, then perhaps you should have a larger timeout value.

I suspect it could have to do with JVM GC pauses or with a lack of memory for 
processing.  Can you correlate your timeouts to JVM GC pause logging?  Can you 
correlate it to high paging activity from vmstat or the like?  Also, check your 
logs to see if you have any connections being dropped and re-established.

Original comment by ingen...@gmail.com on 21 Apr 2014 at 10:46

GoogleCodeExporter commented 9 years ago

Hello Ingen. We are using a very old version of spymemcached and are seeing 
many of the issues mentioned above. Do you know if any of them have been 
addressed in the recent versions?

Original comment by kart...@traveltripper.com on 12 Aug 2014 at 9:16

nidgupta / spymemcached

unrecoverable after CheckedOperationTimeoutException #136