Closed GoogleCodeExporter closed 9 years ago
I am also facing the same kind of issue. Anyone has any idea about it.
Original comment by suni...@gmail.com
on 29 Apr 2010 at 10:09
Similar issue here too.
Memcached server: 1.4.5
Spymemcached client: 2.5
Error message: net.spy.memcached.OperationTimeoutException: Timeout waiting for
value
net.spy.memcached.MemcachedClient.get(MemcachedClient.java:924)
net.spy.memcached.MemcachedClient.get(MemcachedClient.java:939)
Using get(key) and not asyncget. Operation timeout is the default value. I
increased
the value and it still gave the same error.
Thanks
Original comment by patil.sh...@gmail.com
on 1 Jun 2010 at 3:19
I get it but only for the newer version of Java JDK, when i rollback to
Jdk1.6.0_16 the issue goes away?
Original comment by mikeb...@gmail.com
on 9 Aug 2010 at 10:07
I can confirm this issue on server 1.4.2 and client 2.5 with backtrace
Caused by: net.spy.memcached.internal.CheckedOperationTimeoutException: Timed
out waiting for operation - failing node: /127.0.0.1:11211
at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:65)
at net.spy.memcached.internal.GetFuture.get(GetFuture.java:37)
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:917)
Original comment by stevensc...@gmail.com
on 13 Oct 2010 at 8:06
same issue here. same configuration: memcached 1.4.5 and spymemcached 2.5
(though this was present in 2.4.2 as well). consistent timeouts on synchronous
gets. java 1.6.0_20 64-bit on RHEL 5.
we use a 2 second timeout -- which is way to long for pulling data from a
cache. it can be retrieved from the database in less time. seems somewhat
correlated to load.
Original comment by exort...@gmail.com
on 18 Oct 2010 at 4:17
I've also observe CheckedOperationTimeoutException on production environment
when the throughput is high.
Configuration: spymemcached 2.5, memcached server 1.4.2, java 1.6.0_20 64-bit.
This bug is critical for me and currently when the problem occurs the only
solution is to restart application server (jboss 5.1) and try to warm up it
slowly.
Is there any plan to fix this bug? Otherwise I'll have to migrate to another
java memcache library :(.
Original comment by ragno...@gmail.com
on 18 Oct 2010 at 9:25
exortech, ragnor84: can you see if 1.6.0_16 makes the issue go away?
I'm not saying it's a solution, but it'd be good to know if some update in the
JDK caused the difference. Then we can research what's changed in the JDK.
Original comment by ingen...@gmail.com
on 18 Oct 2010 at 9:23
we were having this issue under 1.6.0_14 as well, over 6 months ago now. i
don't think that it's tied to the version of the JDK. i think that it is a
timing/race condition bug in the library. requests just get orphaned
periodically - i'm not sure if the call is ever even made to memcached when
this happens.
Original comment by exort...@gmail.com
on 18 Oct 2010 at 10:47
I can't downgrade java on production environment so I can't check if it helps
but I don't think it's problem with JDK. I agree with exortech that this can be
some problem with orphaned requests.
I've observed one strange thing, maybe it can somehow help in resolving this
issue. When problem occurs CPU load increase from 100% to 200% (on 8 cores
server) but even if I stop sending requests to this server the CPU load is
still around 100%, normally is should be around 5%.
In production environment I've several wars each uses independently
spymemcached (1-to-1 relation). So on each server there're several
spymemcached's instance. Is it correct or there should be only one
spymemcached's instance on server?
Original comment by ragno...@gmail.com
on 19 Oct 2010 at 7:24
I can reproduce the problem even with only a single spymemcached instance.
Original comment by stevensc...@gmail.com
on 19 Oct 2010 at 6:34
I've written up a short test case here:
http://github.com/stevenschlansker/spymemcached-stress
Can someone confirm it works for them? It usually takes an hour or two to
fail, so you might need to run it overnight or something...
Original comment by stevensc...@gmail.com
on 20 Oct 2010 at 10:28
@ragnor84 having multiple instances should be no problem
@stevenschlansker thanks for the additional info, will look into this
Original comment by ingen...@gmail.com
on 21 Oct 2010 at 1:52
If it helps, we only have this problem when issuing GETs to memcached. We
output the status of the nodes in the cluster and they consistently have 0 or
maybe 1 item in the queues.
[{QA sa=xxxxx, #Rops=0, #Wops=0, #iq=0, topRop=null, topWop=null, toWrite=0,
interested=0}, {QA sa=xxxx, #Rops=0, #Wops=0, #iq=2, topRop=null, topWop=null,
toWrite=0, interested=0}, ]
My original assumption was that the queues were getting backed up which was
causing the timeouts, but that doesn't seem to be the case.
Original comment by exort...@gmail.com
on 21 Oct 2010 at 7:06
that's odd. perhaps it's a simple off-by-one kind of a bug. Thanks for the
info @exortech. will look into it further. was thrown off course today, but
should be able to put some time into this tomorrow
Original comment by ingen...@gmail.com
on 22 Oct 2010 at 5:16
Yea I'm seeing a similar error too. I setup 2 caches and in the middle of use
kill one. The client attempt to redistribute to the other but continually gets
immediate timeouts. There isn't even any cache operation requests hitting the
cache process.
Original comment by ibbum...@gmail.com
on 28 Oct 2010 at 9:51
The behavior I'm also seeing along with this is, when the redistribution of
operations occurs, the operations get cancelled which causes a "Cancelled"
RuntimeException to get thrown by OperationFuture (see OperationFuture line 75).
Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException:
Cancelled
at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:75)
at net.spy.memcached.internal.GetFuture.get(GetFuture.java:37)
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:917)
... 66 more
Caused by: java.lang.RuntimeException: Cancelled
... 69 more
After the initial cancelled exception I start to get many timeout errors. If I
catch this error outside of the MemcacheClient and retry the get then the cache
request seem to go to the next memcache node just fine for that request.
Here's my the connection factory config:
ConnectionFactoryBuilder builder = new ConnectionFactoryBuilder()
.setHashAlg(HashAlgorithm.NATIVE_HASH)
.setLocatorType(ConnectionFactoryBuilder.Locator.ARRAY_MOD)
.setProtocol(ConnectionFactoryBuilder.Protocol.TEXT)
.setFailureMode(FailureMode.Redistribute)
.setDaemon(true);
I am connecting to 2 cache nodes and killing one explicitly to test conenction
failure scenarios and it seems to trigger this issue.
If we clone the operations in this redistribution doesn't the waiting future
then lose it's reference? It would seem that this whole redistribution process
would be m00t for gets if that was the case.
Original comment by ibbum...@gmail.com
on 29 Oct 2010 at 1:10
I can confirm seeing the RuntimeException: cancelled on a single node with no
failover (albeit rarely) and it seems to be correlated (in a
happens-at-the-same-time way) with the other error (the timeout)
Original comment by stevensc...@gmail.com
on 29 Oct 2010 at 1:15
Actually I think the specific issue I'm seeing more relates to issue 108 -
sorry for the spam in this thread.
Original comment by ibbum...@gmail.com
on 2 Nov 2010 at 7:19
@ingenthr: Any news? Eagerly awaiting a fix :)
Original comment by stevensc...@gmail.com
on 9 Nov 2010 at 11:29
As you may have seen from my activity today, it's a Spy day. Been looking into
it. Nothing conclusive yet though.
Original comment by ingen...@gmail.com
on 10 Nov 2010 at 3:11
some further observations, if it helps:
- the problem doesn't seem to happen with bulk gets. though it may just be that
it is called many fewer times that single gets. so, if there are opportunities
to swtich from synchronous gets to a bulk get, it may help lessen the impact of
this problem.
- i added additional logging to view the state of the timed out operation. it
would be really nice if the various classes implementing the Operation
interface implemented toString(). anyway, it seems like the Operation has
transitioned to the Complete state; however, the latch was not released. the
buffer is null and there was no exception.
Original comment by exort...@gmail.com
on 10 Nov 2010 at 6:24
@ingenthr Any update please after your investigation.
Original comment by suni...@gmail.com
on 10 Nov 2010 at 4:22
Mike just ran into this in testing. We're waiting on a latch that is timing
out. In Mike's case, it appears that calling isDone() before waiting on the
latch makes the problem go away (or at least become significantly less
frequent).
The latch logic looks pretty clean, though it's possible the timeout isn't
quite working as we expect, or is otherwise causing some confusion. Either
way, more clues.
Full stack follows:
net.spy.memcached.OperationTimeoutException: Timeout waiting for value
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:924)
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:939)
at com.yahoo.ycsb.db.SpymemcachedClient.read(SpymemcachedClient.java:78)
at com.yahoo.ycsb.DBWrapper.read(DBWrapper.java:86)
at com.yahoo.ycsb.workloads.CoreWorkload.doTransactionRead(CoreWorkload.java:438)
at com.yahoo.ycsb.workloads.CoreWorkload.doTransaction(CoreWorkload.java:402)
at com.yahoo.ycsb.ClientThread.run(Client.java:225)
Caused by: net.spy.memcached.internal.CheckedOperationTimeoutException: Timed
out waiting for operation - failing node: /10.1.5.102:11211
at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:65)
at net.spy.memcached.internal.GetFuture.get(GetFuture.java:37)
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:917)
... 6 more
net.spy.memcached.OperationTimeoutException: Timeout waiting for value
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:924)
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:939)
at com.yahoo.ycsb.db.SpymemcachedClient.read(SpymemcachedClient.java:78)
at com.yahoo.ycsb.DBWrapper.read(DBWrapper.java:86)
at com.yahoo.ycsb.workloads.CoreWorkload.doTransactionRead(CoreWorkload.java:438)
at com.yahoo.ycsb.workloads.CoreWorkload.doTransaction(CoreWorkload.java:402)
at com.yahoo.ycsb.ClientThread.run(Client.java:225)
Caused by: net.spy.memcached.internal.CheckedOperationTimeoutException: Timed
out waiting for operation - failing node: /10.1.5.102:11211
at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:65)
at net.spy.memcached.internal.GetFuture.get(GetFuture.java:37)
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:917)
... 6 more
Original comment by dsalli...@gmail.com
on 10 Nov 2010 at 10:29
Here are some work arounds for this bug.
1. Have the following code before accessing data from a future boolean.
while (!future.isDone())
Thread.sleep(1);
Note that this will cause problems if for example the server is down because
the while statement will basically become an infinite loop
2. Increase the timeout in the DefaultConnectionFactory to more than 1 second.
Note that this workaround doesn't technically really get rid of the error, it
just will lower the chances that you will see it again. (I personally have
never seen an operation take more than 5 seconds so if your timeout is greater
than this you shouldn't see this error much)
Original comment by mikewie...@gmail.com
on 11 Nov 2010 at 12:41
Increasing the timeout makes more sense than just avoiding it, but we do need
to understand exactly why this is happening occasionally in the first place.
Original comment by dsalli...@gmail.com
on 11 Nov 2010 at 1:27
could this be related to excessive garbage collection and the fact that gc
pauses the jvm during certain point of the garbage collection process?
let say: your timeout on the future objects is less than the time spent in gc.
so when gc kicks in and pauses the jvm for more time than the timeout, that
will cause all the active threads to timeout?
what gc are you guys using? anyone tried to compare how spymemcache behaves
with different gc collectors? Anyone tried the concurrent low pause collector ?
Original comment by S.Ille...@gmail.com
on 11 Nov 2010 at 10:11
Good call! I turned on -verbose:gc and found:
[GC 852538K->835568K(936640K), 0.0018248 secs]
[GC 852592K->835550K(936640K), 0.0018149 secs]
[GC 852574K->835611K(936640K), 0.0018358 secs]
[GC 852635K->835660K(936640K), 0.0020573 secs]
[GC 852684K->835639K(936640K), 0.0022069 secs]
[GC 852663K->835709K(936640K), 0.0020112 secs]
[GC 852733K->835754K(936640K), 0.0020629 secs]
[GC 842670K->835773K(936640K), 0.0019726 secs]
[Full GC 835773K->835580K(936640K), 1.8937889 secs]
Exception in thread "Thread-5" net.spy.memcached.OperationTimeoutException:
Timeout waiting for value
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:924)
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:939)
at org.sugis.memcache.MemcacheStress.runUnsafe(MemcacheStress.java:66)
at org.sugis.memcache.MemcacheStress.run(MemcacheStress.java:46)
at java.lang.Thread.run(Thread.java:680)
Caused by: net.spy.memcached.internal.CheckedOperationTimeoutException: Timed
out waiting for operation - failing node: /127.0.0.1:11211
at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:65)
at net.spy.memcached.internal.GetFuture.get(GetFuture.java:37)
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:917)
... 4 more
Perhaps a good solution would be
On latch timeout, first check if the operation has actually completed in the
background before throwing the timeout exception?
Original comment by stevensc...@gmail.com
on 11 Nov 2010 at 7:29
Note that this also still happens with -XX:+ConcMarkSweepGC
Original comment by stevensc...@gmail.com
on 11 Nov 2010 at 8:10
I think I have found the reason for the exception. When looking through the
packet trace from a program that generates CheckedOperationTimeoutException's I
was able to see that each operation that threw the exception timed out due to
TCP retransmission. On OS X the threshold for TCP retransmission was 1 sec,
meaning any time there was packet loss or packet delay in the network the
packet was retransmitted. Since the current timeout in Spymemcached is 1 sec we
see exceptions.
A note on the post above also. I ran tests and tried to reproduce the error by
seeing if the garbage collector was running and did not see any excessive
garbage collection or errors. I'm not saying that garbage collection couldn't
cause the problem, but I think that there are other reasons for failure as well.
Original comment by mikewie...@gmail.com
on 13 Nov 2010 at 12:51
I suspect that this issue has nothing to do with the GC, other than that the GC
is slowing down processing on the machine so as to make this timing issue more
likely.
I suspect that it is related to the complete() method in MultiOperationCallback:
if(--remaining == 0) {
originalCallback.receivedStatus(mostRecentStatus);
originalCallback.complete();
}
I haven't verified it yet, but I suspect that remaining is less than zero. As I
mentioned above, the state of the Operation is always (as far as I've seen) set
to OperationState.COMPLETE. It is just that the latch is not released. I
believe that this is because the wrapping MultiOperationCallback does not pass
the complete call through to the decorated Callback.
Original comment by exort...@gmail.com
on 15 Nov 2010 at 6:55
I think that the issue is that --remaining is not thread-safe. If two threads
invoke complete() concurrently then the remaining variable will not be
decremented correctly. AtomicInteger should be used instead.
Original comment by exort...@gmail.com
on 15 Nov 2010 at 4:58
[deleted comment]
this seems to happen more often when accessing large objects.
I have a test app that uses 100 threads and gets 5 different objects randomly.
If the objects are relatively small 10-30KB than I have no exceptions at all.
But when I replaced one object with a big string 1MB, than the exception
started to occur
this issue seems to be related to thread starvation, but not sure whether it's
caused by excessive GC, SpyMemcache uses to much CPU to de-serialize objects or
something else.
Original comment by S.Ille...@gmail.com
on 15 Nov 2010 at 5:33
The issues above seems valid and I am going to begin looking into them.
Exortech, S.Illes, could you guys verify that you are not seeing packet
retransmission during these operations. I have a way to check the value of
remaining throughout the execution of the program and I will report back on
this. Also, I have been experiencing this problem on all operations and you
guys just commented on multiget. Are you only seeing issues with multi-get?
Original comment by mikewie...@gmail.com
on 16 Nov 2010 at 5:25
This is an issue both for get and multiget.
Original comment by exort...@gmail.com
on 16 Nov 2010 at 5:57
I have an idea where the prob is and wrote a possible fix. I'll get with Mike
and test it.
Is it possible Mike's speculation about a TCP retransmit is occurring?
Original comment by ingen...@gmail.com
on 16 Nov 2010 at 6:29
Mike, how do I check for TCP retransmission?
I tried using wireshark and I see lot's of "MEMCACHE [TCP ACKed lost segment]
Get Request". I'm no tcp expert at all.
Also please note that in my case memcache runs on localhost.
Original comment by S.Ille...@gmail.com
on 16 Nov 2010 at 11:15
If you are running on localhost you will not see any retransmission errors.
Retransmissions will only take place over a lossy network and since you are
running on localhost this proves that there is code problem within spymemcached
since communication between client and server running on the same machine
should be near instantaneous and have no packet loss.
I will take a look at exortech's fix and rerun my tests. It sounds like these
retransmission's, while possibly problematic, are not the main issue.
Original comment by mikewie...@gmail.com
on 16 Nov 2010 at 6:49
just a quick update...
I have a good way to generate this, but I think we may need to focus a bit on
what the bug here is.
The specific bug I'm looking at right now is intermittent timeout when running
without a physical network. This is one source of timeouts which shouldn't
really exist. I think this can be addressed.
However, as Mike points out, there will be other sources for timeouts, so those
will need to be handled by the callers.
Also, the concern is that some people are reporting crashing/hanging app
servers. Are occasional timeouts causing this kind of issue? Or is it more
that there is a stream of them in a row and there is no recovery?
@ragnor84 why is the restart of the app server necessary?
I just want to be realistic that we can reduce these, but I don't think it can
be eliminated.
Original comment by ingen...@gmail.com
on 22 Nov 2010 at 7:35
When problem's occurred CPU load increase from 100% to 200% (8 cores) and never
decreased below 100%, even if I've stop sending requests to server. So
something other than my business logic was generating 100% CPU load on server.
Sometimes it looks that when the throughput is high (about 10 000-15 000
requests/s from one spymemcached client) and one timeout occurred then soon
second, third and more timeouts occured - domino effect. After few seconds most
requests get timeouted and application can be used by users and only way to
stop the timeouts (I've waited up to 30 minutes) is to stop all traffic and
restart server that warm-up it slowly.
Currently I can't reproduce it because due to this issue I've migrated to
xmemcached.
I hope someone fix this so I could use spymemcached again.
Original comment by ragno...@gmail.com
on 22 Nov 2010 at 10:24
[deleted comment]
[deleted comment]
final String key = "key";
final int SET_OPERATIONS = 50000;
for (int i = 0; i < SET_OPERATIONS; i++)
memcachedClient.set(key+i, 10*60, Math.random());
assertNotNull(memcachedClient.get(key+(SET_OPERATIONS-1)));
when mamcachedClient.get(key) is called a Timeout exception is thrown.
[spymemcached 2.5]
Original comment by 6marti...@gmail.com
on 30 Nov 2010 at 9:09
@6martina6
Your test there is basically a garbage generator, since an object is created
for every request. Unless you've done some GC tuning, chances are high that
the timeout you see correlates to garbage collection.
Original comment by ingen...@gmail.com
on 7 Dec 2010 at 6:56
I'm sorry this issue has lived on so long. I will post some more tomorrow, but
the good news is that I've been able to make good progress and can document
what I believe to be the answers. I'll probably also ask some of you for help
in testing.
Just to summarize quickly, the way things had been implemented, if a timeout
occurred for any reason (JIT, GC, network retransmit) it was possible for more
operations to stack up which would also timeout, but they would still be
transmitted over the network. This meant when things got bad for a little bit,
they'd have to get really good for a while to have a chance at catching up.
I have a fix that will ensure this doesn't happen. I expect that in cases like
ragnor84's, we'll still see a big "bubble" of timeouts, but things will recover
very quickly.
The other thing I want to do is document a bit better how to set the default
timeout with a ConnectionFactoryBuilder. I also want to test and document a
few JVM params that could be tried. I may need help with this too.
None of this will insulate you from timeouts though. Running in a JVM on a
system (especially if it's virtualized) on a TCP network you can control a lot
of things, but most of them don't offer guarantees out of the box. In some
cases, you can really never get a complete guarantee.
That said, I believe it can be better so we'll make it better.
Original comment by ingen...@gmail.com
on 7 Dec 2010 at 7:10
Subsequent to my posting here the other day, we've been running my code changes
in a long term test. Looks good so far. Also, this exposed some issues with
existing tests that I'm in the process of cleaning up.
If anyone else wants to help test, please let me know.
Original comment by ingen...@gmail.com
on 9 Dec 2010 at 5:38
Good to see progress on this issue. I would like to help testing.
Original comment by ilkinulas
on 12 Dec 2010 at 6:07
We are using spymemcached for a reasonably high volume website and are
experiencing this issue. In particular, comment 40 is eerily familiar.
We have 5 memcached servers, and 9 clients. At seemingly random times, but
usually when under decent load (12-15 page requests per second, per client, or
roughly 200 memcached requests per second per client, CPU on all of the
application servers increases quite noticeably, until 1 or more servers goes to
100% CPU and all application threads are hung waiting for spymemcached.
Strangely, with 1 or 2 servers completely crashing, the other servers often
seem to benefit, with decreased CPU usage and faster execution times.
After restarting the application on a failed server, we have to warm it up very
slowly, a few requests at a time, before reintroducing the server into the load
balancer. Otherwise we'll get hung requests almost immediately. Sometimes even
after "warming up" we'll get hung requests within 2-10 minutes. If the server
is stable for an hour or so, it will probably be fine for a day or two at
least, until the issue reoocurs on the same or another server.
If possible, my team would love to participate in testing the fix for this, as
our other options are fairly unattractive.
Original comment by brianchr...@gmail.com
on 14 Dec 2010 at 9:36
great there are some folks willing to check it out :)
I was trying to get something out today, but it'll take until tomorrow.
Mike has been running the one above long term, but when I wrote some tests I
found some others were broken, so it's been a steady process of getting good
tests and correcting existing ones. It turns out there were a number of tests
that probably should have been timing out, but weren't because there was no
real check.
@brian: the fix, I believe, could with your situation. I've only had toy tests
to try it with though. I did try it with a large number of threads on a system
with many CPU though.
the only concern I have is that what you describe could also be associated with
VM memory pressure. Have you tried logging GCs to see if you could correlate.
More specifically, if under load you have higher latencies, you'd start seeing
more CPU usage in garbage collection time on the tenured spaces. Dedicating
more to the young/new spaces may help keep that down.
That said, I do think my fixes help with relieving the pressure if you start
getting into a bad situation, so you should recover more quickly.
Original comment by ingen...@gmail.com
on 17 Dec 2010 at 6:54
My changes have had one test in particular failing for some time. At first, I
thought my changes were causing it, but after some investigation and discussion
with Dustin, it appears that it's a real resurgence of issue 35.
I can, and will, still post a test version of the client, but I'll need to fix
this issue too since it may be another cause of the same symptom.
Original comment by ingen...@gmail.com
on 27 Dec 2010 at 4:05
Original issue reported on code.google.com by
S.Ille...@gmail.com
on 27 Apr 2010 at 4:30