Closed GoogleCodeExporter closed 9 years ago
Well, it may not be a resurgence of issue 35, as I found one other area of
testing that needed some fixing. After fixing that, I still get occasional
failures, which I am looking into, but I've posted a build that does fix the
problem I've talked about above.
It's available here:
http://code.google.com/p/spymemcached/downloads/detail?name=memcached-2.5-23-g7b
7ae44.jar&can=2&q=
The javadoc/source are also in downloads:
http://code.google.com/p/spymemcached/downloads/list
The work in progress code is also on github:
https://github.com/ingenthr/java-memcached-client/commits/spy136WIP
It'll need some reformatting/documentation/squashing before it goes into a real
release, but I think this is going the right direction.
Note that there is a new state for an operation, that of TIMEDOUT. This and
the change of the continuious timeout handling will likely trigger us to update
to version 2.6, but that's without any deprecation of existing API.
Original comment by ingen...@gmail.com
on 28 Dec 2010 at 3:43
I fixed the occasional test issue referred to above and the code I posted
yesterday had a problem with asynch get operations, leading to all of them
going the duration of any timeout on the get() of the Future.
The code is available here:
http://code.google.com/p/spymemcached/downloads/detail?name=memcached-2.5-26-g3c
ad4cb.jar&can=2&q=
And just like my last posting, the Javadoc/sources and branch on github have
been updated.
Original comment by ingen...@gmail.com
on 29 Dec 2010 at 5:20
It appears a few people have downloaded this and are giving it a shot. Please
post any feedback.
Also, if you're interested, you can review the (newly cleaned up) code commits
over here: http://review.membase.org/#q,status:open+project:spymemcached,n,z
Original comment by ingen...@gmail.com
on 3 Jan 2011 at 1:14
Original comment by ingen...@gmail.com
on 3 Jan 2011 at 1:20
We tried the jar given in Comment 52 and we still have exhibited the problem
using the spymemcached-stress application. Has this been fixed somewhere?
Original comment by radost...@gmail.com
on 2 Feb 2011 at 5:54
We took the jar dates from 02 Jan and we still encounter timeout on heavy load.
By the way, it seems that the client dont reconnect itself anymore after the
restart of memcached (and the cpu rise 100%).
Original comment by cyril.d...@gmail.com
on 2 Feb 2011 at 6:59
Yes, using this client has taken down our server twice. I'm in the middle of
swapping out our client with xmemcached, since having our server go down is
unacceptable.
Original comment by radost...@gmail.com
on 2 Feb 2011 at 7:24
@radost... others have said it has helped and I was going to put together a
release candidate. Can you give me some more detail on how it crashed your
server?
Original comment by ingen...@gmail.com
on 4 Feb 2011 at 1:46
ingen... our server is on the stock 2.5 and threw the
net.spy.memcached.internal.CheckedOperationTimeoutException error after running
for a about a week. And all the threads on the app server backed up even
though we had a 2 second timeout and there is a global 1 second timeout. The
memcache server was unaffected, affect we restarted the app server the
application continued on fine, so it is the same thing others are seeing where
the client just can't reconnect. I would have thought with the timeouts though
that the server would have continued on - just slower, but the threads were all
backed up/not responding. We tested your supplied jar from comment 52 using
the http://github.com/stevenschlansker/spymemcached-stress application and
again saw the timeout exceptions. It took less than a day to see the
exceptions. If I have time this weekend I will try it again. Incidentally,
have you tried bisecting into the offending commit using the
spymemcached-stress application?
Original comment by radost...@gmail.com
on 5 Feb 2011 at 1:35
@radost...
First a couple of answers to your questions. I did look at the
spymemcached-stress app, and found it was much like a built in test that Dustin
had added to ensure the client recovered after timeouts. I spent a lot of time
with that test and my modifications to ensure that after a timeout, the client
does recover. As I'd said above, being realistic, it'll be impossible to
completely remove ever receiving a timeout.
It sounds like your server is in an environment where you regularly saturate
your resources and you just want it to continue to handle things, just slower.
If that's the case, you can increase the timeout time.
Regarding to the "server continuing on, just slower", that's actually not the
case. You can hit a timeout for a number of reasons (network issues, JVM
work). I would expect when you see a timeout, you'll see a big series of them,
and then it will stabalize and come back to normal. There is no built in
backoff to slow down the app to reduce the number of timeouts seen before
recovery. That could be done at the application level.
Finally, on the question of bisecting, there isn't an offending commit per se.
The latch and timeout were added when the client appeared to hang forever in
certain failure cases. It's correct for them to be there, but what I believe
to be incorrect was the very low default timeout value and the fact that we
would not count the beginning of the timeout from operation creation, skipping
sending it over the network if it'd already timed out. That behavior would
more likely create a scenario with an extended number of timedout operations,
rather than a burst of them.
I expect to cover what I've put in this commit in the release notes:
https://github.com/dustin/java-memcached-client/commit/0e1ebdb844b11f141e389ef58
4288a39219512a8
Some of the JVM tuning in that commit message may help you.
Please understand, I really want to fix this and have put quite a bit of time
into looking into it. If there's a scenario where the client is breaking
servers, we'll need to find a better way to handle that. It sounds like, from
what you say above, you'll need to turn down the timeout or restrict the number
of client threads to ensure it doesn't get so overloaded that it won't recover.
If you don't handle it specifically in your code, my guess is that a timeout
should turn into an HTTP 503.
Original comment by ingen...@gmail.com
on 7 Feb 2011 at 9:03
Thanks for your response Ingen... I was misinterpreting the timeouts I was
seeing from the spymemcache-stress application as evidence of a continuing
problem, but I see that it just causes timeouts from high gc. I have tried to
replicate the issue with a non-recoverable memcached connection, but in my
tests the connection does recover.
Original comment by radost...@gmail.com
on 9 Feb 2011 at 1:37
No problem, I'm glad it looks like things look better for you.
It is a good point though, I should try testing this under a webapp in tomcat
to provide the right kind of pattern for folks.
Original comment by ingen...@gmail.com
on 9 Feb 2011 at 5:44
I can verify that this fix makes things better. We are running spymemcached
2.4.2 in our production environment and see this issue from time to time. We
have recreated it in a simulated performance test and upgrading to the JAR
referenced above (with the fix for this bug) cuts the number of errors in half
for us and the system recovered without needing a restart. I feel comfortable
attributing the remaining time outs to GC, CPU spikes, etc.
Thanks for fixing this. We will be testing this extensively and rolling it out
in 2 weeks.
Original comment by brandon....@gmail.com
on 15 Feb 2011 at 10:23
Excellent, thanks for the validation Brandon. Ironically, in another window, I
was just drafting an email about the 2.6rc1 being posted. If you can test with
it and provide any feedback, I would appreciate it.
Original comment by ingen...@gmail.com
on 15 Feb 2011 at 10:32
Sure thing. I've upgraded to 2.6rc1. If all goes well, we will be taking this
to production on March 1st. I'll let you know how it works out there once it
gets some real stress testing :)
Original comment by brandon....@gmail.com
on 16 Feb 2011 at 2:34
Do this bug fix is included in 2.6rc1?
Original comment by ysob...@gmail.com
on 5 Apr 2011 at 6:29
Yes. We had unrelated issues on March 1st, so had to postpone this upgrade
until March 24th. We've been running with it since. Under extremely high CPU
load, I still see timeout errors, but far fewer of them than before. I think
the remaining errors are my fault.
What I saw before was the errors themselves perpetuated the high CPU
utilization.
Original comment by brandon....@gmail.com
on 6 Apr 2011 at 7:02
I am seeing timeouts using the 2.6 version. We have a loader process that
exclusively does bulk gets / bulk writes (via CacheLoader), and we see the
issue more there. It doesn't seem to be a real timeout that is the cause of the
problem, we can put an arbitrarily large value for the global operation timeout
and still see the issue.
We see the issue sporadically in our other processes that do a more mixed bag
of memcached operations, and it tends to correct itself faster than did the
previous version.
Original comment by jonat...@gmail.com
on 2 Jun 2011 at 7:58
@jonat...
The changes in 2.6 won't eliminate timeouts. I am guessing, since you're doing
bulk loading, that you may be overwhelming the queues and the VM which leads to
occasionally going beyond the default 2500ms timeout value.
Since you've already tuned the operation timeout, I might recommend some VM
tuning. From my commit notes:
First, by default, garbage collection times may easily go over 1sec.
Testing with simple toy tests shows this quite clearly, even on
systems with lots of CPUs and a decent amount of memory. Of course,
much of this can be controlled with GC tuning on the JVM. With the
hotspot JVM, look to this whitepaper:
http://java.sun.com/j2se/reference/whitepapers/memorymanagement_whitepaper.pdf
Testing showed the following to be particularly useful:
-XX:+UseConcMarkSweepGC -XX:MaxGCPauseMillis=850
There is a CPU time tradeoff for this.
Even with these, testing showed some 1 second timeouts when GCs near a
half a second. To use this software though, we shouldn't expect people
to have to tune the GC, so raising the default seems like the
right thing to do.
Second, many systems use spymemcached on virtualized or cloud environments.
The processes running there do not have any guarantee of execution
time. It'd be really unlikely for a thread to be starved for more than
a second, but it is possible and shouldn't make things stop. Raising this
a bit will help.
Third, and perhaps most importantly, most people run applications on
networks that do not offer any guarantee around response time. If
the network is oversubscribed or even minor blips occur on the network
can cause TCP retransmissions. While many TCP implementations ignore
it, RFC 2988 specifies rounding up to 1sec when calculating
TCP retransmit timeouts. Blips will occur, and rather than force
this seemingly synchronous get to timeout, it may be better to
just wait a bit longer by default.
I really need to extract this bug thread to a FAQ on timeouts
Original comment by ingen...@gmail.com
on 4 Jun 2011 at 4:33
Sorry for waking up the thread after 3 years :)
We are facing the Timeout issue quite often almost throughout the day.
What is the size of the op queue, read queue, write queues?
Does initialising the queues help? Like,
// queue factory
builder.setOpQueueFactory(new ArrayOperationQueueFactory(config.maxQueueSize));
builder.setReadOpQueueFactory(new
ArrayOperationQueueFactory(config.maxQueueSize));
builder.setWriteOpQueueFactory(new
ArrayOperationQueueFactory(config.maxQueueSize));
We are using spymemcached 2.8.2 and couch base client 1.2.1
Original comment by sanaulla...@gmail.com
on 19 Apr 2014 at 3:50
@sana...
It depends on the cause of the timeouts and we can't really identify that from
here. What I can say is that under 136 we no longer send things that are
already timed out which allows for sort of a pressure relief valve.
The important thing with timeouts is to think about what you'd do differently
after the timeout value. If you don't have anything to do differently and you
just really want an answer, then perhaps you should have a larger timeout value.
I suspect it could have to do with JVM GC pauses or with a lack of memory for
processing. Can you correlate your timeouts to JVM GC pause logging? Can you
correlate it to high paging activity from vmstat or the like? Also, check your
logs to see if you have any connections being dropped and re-established.
Original comment by ingen...@gmail.com
on 21 Apr 2014 at 10:46
Hello Ingen. We are using a very old version of spymemcached and are seeing
many of the issues mentioned above. Do you know if any of them have been
addressed in the recent versions?
Original comment by kart...@traveltripper.com
on 12 Aug 2014 at 9:16
Original issue reported on code.google.com by
S.Ille...@gmail.com
on 27 Apr 2010 at 4:30