unrecoverable after CheckedOperationTimeoutException

GoogleCodeExporter commented 9 years ago

I'm seeing the following CheckedOperationTimeoutException under medium load
( ~ 100 concurrent reads ) 

net.spy.memcached.internal.CheckedOperationTimeoutException: Timed out
waiting for operation - failing node: /127.0.0.1:11211
        at
net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:59)
        at net.spy.memcached.internal.GetFuture.get(GetFuture.java:37)

client: 2.5 
server: 1.4.4 (Linux)
Timeout is set for 3 sec ( async call )

public <T> T getAfterDisableCheck(String key, Class<T> aclazz) {
        T cacheObj = null;
        try {
            Future<Object> result = this.cacheClient.asyncGet(key);
            cacheObj = (T) result.get(getConfig().getTimeOutInSec(), TimeUnit.SECONDS);
            if (cacheObj != null && !aclazz.isAssignableFrom(cacheObj.getClass())) {
                Cache.MEMCACHED_CLIENT_LOG.error("Invalid expected type set for key: "
+ key + " " + aclazz
                        + " is NOT a instance of " + cacheObj.getClass());
                cacheObj = null;
            }
        } catch (Exception e) {
            Cache.MEMCACHED_CLIENT_LOG.error("Error interacting with memcahce " +
"for key: " + key, e);
        }
        return cacheObj;
    }

I'm using Linux as a client and memcached is access via localhost

Thanks

Original issue reported on code.google.com by S.Ille...@gmail.com on 27 Apr 2010 at 4:30

GoogleCodeExporter commented 9 years ago

I am also facing the same kind of issue. Anyone has any idea about it.

Original comment by suni...@gmail.com on 29 Apr 2010 at 10:09

GoogleCodeExporter commented 9 years ago

Similar issue here too.
Memcached server: 1.4.5
Spymemcached client: 2.5

Error message: net.spy.memcached.OperationTimeoutException: Timeout waiting for 
value
    net.spy.memcached.MemcachedClient.get(MemcachedClient.java:924)
    net.spy.memcached.MemcachedClient.get(MemcachedClient.java:939)
Using get(key) and not asyncget. Operation timeout is the default value. I 
increased
the value and it still gave the same error.

Thanks

Original comment by patil.sh...@gmail.com on 1 Jun 2010 at 3:19

GoogleCodeExporter commented 9 years ago

I get it but only for the newer version of Java JDK, when i rollback to 
Jdk1.6.0_16 the issue goes away?

Original comment by mikeb...@gmail.com on 9 Aug 2010 at 10:07

GoogleCodeExporter commented 9 years ago

I can confirm this issue on server 1.4.2 and client 2.5 with backtrace
Caused by: net.spy.memcached.internal.CheckedOperationTimeoutException: Timed 
out waiting for operation - failing node: /127.0.0.1:11211 
at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:65) 
at net.spy.memcached.internal.GetFuture.get(GetFuture.java:37) 
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:917)

Original comment by stevensc...@gmail.com on 13 Oct 2010 at 8:06

GoogleCodeExporter commented 9 years ago

same issue here. same configuration: memcached 1.4.5 and spymemcached 2.5 
(though this was present in 2.4.2 as well). consistent timeouts on synchronous 
gets. java 1.6.0_20 64-bit on RHEL 5.

we use a 2 second timeout -- which is way to long for pulling data from a 
cache. it can be retrieved from the database in less time. seems somewhat 
correlated to load.

Original comment by exort...@gmail.com on 18 Oct 2010 at 4:17

GoogleCodeExporter commented 9 years ago

I've also observe CheckedOperationTimeoutException on production environment 
when the throughput is high.
Configuration: spymemcached 2.5, memcached server 1.4.2, java 1.6.0_20 64-bit.

This bug is critical for me and currently when the problem occurs the only 
solution is to restart application server (jboss 5.1) and try to warm up it 
slowly.

Is there any plan to fix this bug? Otherwise I'll have to migrate to another 
java memcache library :(.

Original comment by ragno...@gmail.com on 18 Oct 2010 at 9:25

GoogleCodeExporter commented 9 years ago

exortech, ragnor84: can you see if 1.6.0_16 makes the issue go away?

I'm not saying it's a solution, but it'd be good to know if some update in the 
JDK caused the difference.  Then we can research what's changed in the JDK.

Original comment by ingen...@gmail.com on 18 Oct 2010 at 9:23

GoogleCodeExporter commented 9 years ago

we were having this issue under 1.6.0_14 as well, over 6 months ago now. i 
don't think that it's tied to the version of the JDK. i think that it is a 
timing/race condition bug in the library. requests just get orphaned 
periodically - i'm not sure if the call is ever even made to memcached when 
this happens.

Original comment by exort...@gmail.com on 18 Oct 2010 at 10:47

GoogleCodeExporter commented 9 years ago

I can't downgrade java on production environment so I can't check if it helps 
but I don't think it's problem with JDK. I agree with exortech that this can be 
some problem with orphaned requests. 
I've observed one strange thing, maybe it can somehow help in resolving this 
issue. When problem occurs CPU load increase from 100% to 200% (on 8 cores 
server) but even if I stop sending requests to this server the CPU load is 
still around 100%, normally is should be around 5%.

In production environment I've several wars each uses independently 
spymemcached (1-to-1 relation). So on each server there're several 
spymemcached's instance. Is it correct or there should be only one 
spymemcached's instance on server?

Original comment by ragno...@gmail.com on 19 Oct 2010 at 7:24

GoogleCodeExporter commented 9 years ago

I can reproduce the problem even with only a single spymemcached instance.

Original comment by stevensc...@gmail.com on 19 Oct 2010 at 6:34

GoogleCodeExporter commented 9 years ago

I've written up a short test case here:
http://github.com/stevenschlansker/spymemcached-stress

Can someone confirm it works for them?  It usually takes an hour or two to 
fail, so you might need to run it overnight or something...

Original comment by stevensc...@gmail.com on 20 Oct 2010 at 10:28

GoogleCodeExporter commented 9 years ago

@ragnor84 having multiple instances should be no problem

@stevenschlansker thanks for the additional info, will look into this

Original comment by ingen...@gmail.com on 21 Oct 2010 at 1:52

GoogleCodeExporter commented 9 years ago

If it helps, we only have this problem when issuing GETs to memcached. We 
output the status of the nodes in the cluster and they consistently have 0 or 
maybe 1 item in the queues.

[{QA sa=xxxxx, #Rops=0, #Wops=0, #iq=0, topRop=null, topWop=null, toWrite=0, 
interested=0}, {QA sa=xxxx, #Rops=0, #Wops=0, #iq=2, topRop=null, topWop=null, 
toWrite=0, interested=0}, ]

My original assumption was that the queues were getting backed up which was 
causing the timeouts, but that doesn't seem to be the case.

Original comment by exort...@gmail.com on 21 Oct 2010 at 7:06

GoogleCodeExporter commented 9 years ago

that's odd.  perhaps it's a simple off-by-one kind of a bug.  Thanks for the 
info @exortech.  will look into it further.  was thrown off course today, but 
should be able to put some time into this tomorrow

Original comment by ingen...@gmail.com on 22 Oct 2010 at 5:16

GoogleCodeExporter commented 9 years ago

Yea I'm seeing a similar error too.  I setup 2 caches and in the middle of use 
kill one.  The client attempt to redistribute to the other but continually gets 
immediate timeouts.  There isn't even any cache operation requests hitting the 
cache process.

Original comment by ibbum...@gmail.com on 28 Oct 2010 at 9:51

GoogleCodeExporter commented 9 years ago

The behavior I'm also seeing along with this is, when the redistribution of 
operations occurs,  the operations get cancelled which causes a "Cancelled" 
RuntimeException to get thrown by OperationFuture (see OperationFuture line 75).

Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
Cancelled
    at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:75)
    at net.spy.memcached.internal.GetFuture.get(GetFuture.java:37)
    at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:917)
    ... 66 more
Caused by: java.lang.RuntimeException: Cancelled
    ... 69 more

After the initial cancelled exception I start to get many timeout errors.  If I 
catch this error outside of the MemcacheClient and retry the get then the cache 
request seem to go to the next memcache node just fine for that request.

Here's my the connection factory config:

ConnectionFactoryBuilder builder = new ConnectionFactoryBuilder()
               .setHashAlg(HashAlgorithm.NATIVE_HASH)
               .setLocatorType(ConnectionFactoryBuilder.Locator.ARRAY_MOD)
               .setProtocol(ConnectionFactoryBuilder.Protocol.TEXT)
               .setFailureMode(FailureMode.Redistribute)
               .setDaemon(true);

I am connecting to 2 cache nodes and killing one explicitly to test conenction 
failure scenarios and it seems to trigger this issue.

If we clone the operations in this redistribution doesn't the waiting future 
then lose it's reference?  It would seem that this whole redistribution process 
would be m00t for gets if that was the case.

Original comment by ibbum...@gmail.com on 29 Oct 2010 at 1:10

GoogleCodeExporter commented 9 years ago

I can confirm seeing the RuntimeException: cancelled on a single node with no 
failover (albeit rarely) and it seems to be correlated (in a 
happens-at-the-same-time way) with the other error (the timeout)

Original comment by stevensc...@gmail.com on 29 Oct 2010 at 1:15

GoogleCodeExporter commented 9 years ago

Actually I think the specific issue I'm seeing more relates to issue 108 - 
sorry for the spam in this thread.

Original comment by ibbum...@gmail.com on 2 Nov 2010 at 7:19

GoogleCodeExporter commented 9 years ago

@ingenthr:  Any news?  Eagerly awaiting a fix :)

Original comment by stevensc...@gmail.com on 9 Nov 2010 at 11:29

GoogleCodeExporter commented 9 years ago

As you may have seen from my activity today, it's a Spy day.  Been looking into 
it.  Nothing conclusive yet though.

Original comment by ingen...@gmail.com on 10 Nov 2010 at 3:11

GoogleCodeExporter commented 9 years ago

some further observations, if it helps:

- the problem doesn't seem to happen with bulk gets. though it may just be that 
it is called many fewer times that single gets. so, if there are opportunities 
to swtich from synchronous gets to a bulk get, it may help lessen the impact of 
this problem.

- i added additional logging to view the state of the timed out operation. it 
would be really nice if the various classes implementing the Operation 
interface implemented toString(). anyway, it seems like the Operation has 
transitioned to the Complete state; however, the latch was not released. the 
buffer is null and there was no exception.

Original comment by exort...@gmail.com on 10 Nov 2010 at 6:24

GoogleCodeExporter commented 9 years ago

@ingenthr Any update please after your investigation.

Original comment by suni...@gmail.com on 10 Nov 2010 at 4:22

GoogleCodeExporter commented 9 years ago

Mike just ran into this in testing.  We're waiting on a latch that is timing 
out.  In Mike's case, it appears that calling isDone() before waiting on the 
latch makes the problem go away (or at least become significantly less 
frequent).

The latch logic looks pretty clean, though it's possible the timeout isn't 
quite working as we expect, or is otherwise causing some confusion.  Either 
way, more clues.

Full stack follows:

net.spy.memcached.OperationTimeoutException: Timeout waiting for value
    at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:924)
    at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:939)
    at com.yahoo.ycsb.db.SpymemcachedClient.read(SpymemcachedClient.java:78)
    at com.yahoo.ycsb.DBWrapper.read(DBWrapper.java:86)
    at com.yahoo.ycsb.workloads.CoreWorkload.doTransactionRead(CoreWorkload.java:438)
    at com.yahoo.ycsb.workloads.CoreWorkload.doTransaction(CoreWorkload.java:402)
    at com.yahoo.ycsb.ClientThread.run(Client.java:225)
Caused by: net.spy.memcached.internal.CheckedOperationTimeoutException: Timed 
out waiting for operation - failing node: /10.1.5.102:11211
    at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:65)
    at net.spy.memcached.internal.GetFuture.get(GetFuture.java:37)
    at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:917)
    ... 6 more
net.spy.memcached.OperationTimeoutException: Timeout waiting for value
    at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:924)
    at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:939)
    at com.yahoo.ycsb.db.SpymemcachedClient.read(SpymemcachedClient.java:78)
    at com.yahoo.ycsb.DBWrapper.read(DBWrapper.java:86)
    at com.yahoo.ycsb.workloads.CoreWorkload.doTransactionRead(CoreWorkload.java:438)
    at com.yahoo.ycsb.workloads.CoreWorkload.doTransaction(CoreWorkload.java:402)
    at com.yahoo.ycsb.ClientThread.run(Client.java:225)
Caused by: net.spy.memcached.internal.CheckedOperationTimeoutException: Timed 
out waiting for operation - failing node: /10.1.5.102:11211
    at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:65)
    at net.spy.memcached.internal.GetFuture.get(GetFuture.java:37)
    at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:917)
    ... 6 more

Original comment by dsalli...@gmail.com on 10 Nov 2010 at 10:29

GoogleCodeExporter commented 9 years ago

Here are some work arounds for this bug.

1. Have the following code before accessing data from a future boolean.

while (!future.isDone())
     Thread.sleep(1);

Note that this will cause problems if for example the server is down because 
the while statement will basically become an infinite loop

2. Increase the timeout in the DefaultConnectionFactory to more than 1 second.

Note that this workaround doesn't technically really get rid of the error, it 
just will lower the chances that you will see it again. (I personally have 
never seen an operation take more than 5 seconds so if your timeout is greater 
than this you shouldn't see this error much)

Original comment by mikewie...@gmail.com on 11 Nov 2010 at 12:41

GoogleCodeExporter commented 9 years ago

Increasing the timeout makes more sense than just avoiding it, but we do need 
to understand exactly why this is happening occasionally in the first place.

Original comment by dsalli...@gmail.com on 11 Nov 2010 at 1:27

GoogleCodeExporter commented 9 years ago

could this be related to excessive garbage collection and the fact that gc 
pauses the jvm during certain point of the garbage collection process? 

let say: your timeout on the future objects is less than the time spent in gc. 
so when gc kicks in and pauses the jvm for more time than the timeout, that 
will cause all the active threads to timeout?

what gc are you guys using? anyone tried to compare how spymemcache behaves 
with different gc collectors? Anyone tried the concurrent low pause collector ?

Original comment by S.Ille...@gmail.com on 11 Nov 2010 at 10:11

GoogleCodeExporter commented 9 years ago

Good call!  I turned on -verbose:gc and found:
[GC 852538K->835568K(936640K), 0.0018248 secs]
[GC 852592K->835550K(936640K), 0.0018149 secs]
[GC 852574K->835611K(936640K), 0.0018358 secs]
[GC 852635K->835660K(936640K), 0.0020573 secs]
[GC 852684K->835639K(936640K), 0.0022069 secs]
[GC 852663K->835709K(936640K), 0.0020112 secs]
[GC 852733K->835754K(936640K), 0.0020629 secs]
[GC 842670K->835773K(936640K), 0.0019726 secs]
[Full GC 835773K->835580K(936640K), 1.8937889 secs]
Exception in thread "Thread-5" net.spy.memcached.OperationTimeoutException: 
Timeout waiting for value
        at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:924)
        at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:939)
        at org.sugis.memcache.MemcacheStress.runUnsafe(MemcacheStress.java:66)
        at org.sugis.memcache.MemcacheStress.run(MemcacheStress.java:46)
        at java.lang.Thread.run(Thread.java:680)
Caused by: net.spy.memcached.internal.CheckedOperationTimeoutException: Timed 
out waiting for operation - failing node: /127.0.0.1:11211
        at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:65)
        at net.spy.memcached.internal.GetFuture.get(GetFuture.java:37)
        at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:917)
        ... 4 more

Perhaps a good solution would be
On latch timeout, first check if the operation has actually completed in the 
background before throwing the timeout exception?

Original comment by stevensc...@gmail.com on 11 Nov 2010 at 7:29

GoogleCodeExporter commented 9 years ago

Note that this also still happens with -XX:+ConcMarkSweepGC

Original comment by stevensc...@gmail.com on 11 Nov 2010 at 8:10

GoogleCodeExporter commented 9 years ago

I think I have found the reason for the exception. When looking through the 
packet trace from a program that generates CheckedOperationTimeoutException's I 
was able to see that each operation that threw the exception timed out due to 
TCP retransmission. On OS X the threshold for TCP retransmission was 1 sec, 
meaning any time there was packet loss or packet delay in the network the 
packet was retransmitted. Since the current timeout in Spymemcached is 1 sec we 
see exceptions.

A note on the post above also. I ran tests and tried to reproduce the error by 
seeing if the garbage collector was running and did not see any excessive 
garbage collection or errors. I'm not saying that garbage collection couldn't 
cause the problem, but I think that there are other reasons for failure as well.

Original comment by mikewie...@gmail.com on 13 Nov 2010 at 12:51

GoogleCodeExporter commented 9 years ago

I suspect that this issue has nothing to do with the GC, other than that the GC 
is slowing down processing on the machine so as to make this timing issue more 
likely.

I suspect that it is related to the complete() method in MultiOperationCallback:

        if(--remaining == 0) {
            originalCallback.receivedStatus(mostRecentStatus);
            originalCallback.complete();
        }

I haven't verified it yet, but I suspect that remaining is less than zero. As I 
mentioned above, the state of the Operation is always (as far as I've seen) set 
to OperationState.COMPLETE. It is just that the latch is not released. I 
believe that this is because the wrapping MultiOperationCallback does not pass 
the complete call through to the decorated Callback.

Original comment by exort...@gmail.com on 15 Nov 2010 at 6:55

GoogleCodeExporter commented 9 years ago

I think that the issue is that --remaining is not thread-safe. If two threads 
invoke complete() concurrently then the remaining variable will not be 
decremented correctly. AtomicInteger should be used instead.

Original comment by exort...@gmail.com on 15 Nov 2010 at 4:58

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

this seems to happen more often when accessing large objects.

I have a test app that uses 100 threads and gets 5 different objects randomly. 
If the objects are relatively small 10-30KB than I have no exceptions at all. 
But when I replaced one object with a big string 1MB, than the exception 
started to occur 

this issue seems to be related to thread starvation, but not sure whether it's 
caused by excessive GC, SpyMemcache uses to much CPU to de-serialize objects or 
something else.

Original comment by S.Ille...@gmail.com on 15 Nov 2010 at 5:33

GoogleCodeExporter commented 9 years ago

The issues above seems valid and I am going to begin looking into them. 
Exortech, S.Illes, could you guys verify that you are not seeing packet 
retransmission during these operations. I have a way to check the value of 
remaining throughout the execution of the program and I will report back on 
this. Also, I have been experiencing this problem on all operations and you 
guys just commented on multiget. Are you only seeing issues with multi-get?

Original comment by mikewie...@gmail.com on 16 Nov 2010 at 5:25

GoogleCodeExporter commented 9 years ago

This is an issue both for get and multiget.

Original comment by exort...@gmail.com on 16 Nov 2010 at 5:57

GoogleCodeExporter commented 9 years ago

I have an idea where the prob is and wrote a possible fix.  I'll get with Mike 
and test it.

Is it possible Mike's speculation about a TCP retransmit is occurring?

Original comment by ingen...@gmail.com on 16 Nov 2010 at 6:29

GoogleCodeExporter commented 9 years ago

Mike, how do I check for TCP retransmission? 

I tried using wireshark and I see lot's of "MEMCACHE [TCP ACKed lost segment] 
Get Request". I'm no tcp expert at all.

Also please note that in my case memcache runs on localhost.

Original comment by S.Ille...@gmail.com on 16 Nov 2010 at 11:15

GoogleCodeExporter commented 9 years ago

If you are running on localhost you will not see any retransmission errors. 
Retransmissions will only take place over a lossy network and since you are 
running on localhost this proves that there is code problem within spymemcached 
since communication between client and server running on the same machine 
should be near instantaneous and have no packet loss.

I will take a look at exortech's fix and rerun my tests. It sounds like these 
retransmission's, while possibly problematic, are not the main issue.

Original comment by mikewie...@gmail.com on 16 Nov 2010 at 6:49

GoogleCodeExporter commented 9 years ago

just a quick update...

I have a good way to generate this, but I think we may need to focus a bit on 
what the bug here is.

The specific bug I'm looking at right now is intermittent timeout when running 
without a physical network.  This is one source of timeouts which shouldn't 
really exist.  I think this can be addressed.

However, as Mike points out, there will be other sources for timeouts, so those 
will need to be handled by the callers.

Also, the concern is that some people are reporting crashing/hanging app 
servers.  Are occasional timeouts causing this kind of issue?  Or is it more 
that there is a stream of them in a row and there is no recovery?

@ragnor84 why is the restart of the app server necessary?

I just want to be realistic that we can reduce these, but I don't think it can 
be eliminated.

Original comment by ingen...@gmail.com on 22 Nov 2010 at 7:35

GoogleCodeExporter commented 9 years ago

When problem's occurred CPU load increase from 100% to 200% (8 cores) and never 
decreased below 100%, even if I've stop sending requests to server. So 
something other than my business logic was generating 100% CPU load on server.
Sometimes it looks that when the throughput is high (about 10 000-15 000 
requests/s from one spymemcached client) and one timeout occurred then soon 
second, third and more timeouts occured - domino effect. After few seconds most 
requests get timeouted and application can be used by users and only way to 
stop the timeouts (I've waited up to 30 minutes) is to stop all traffic and 
restart server that warm-up it slowly.
Currently I can't reproduce it because due to this issue I've migrated to 
xmemcached.
I hope someone fix this so I could use spymemcached again.

Original comment by ragno...@gmail.com on 22 Nov 2010 at 10:24

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

final String key = "key";
final int SET_OPERATIONS = 50000;
for (int i = 0; i < SET_OPERATIONS; i++)
    memcachedClient.set(key+i, 10*60, Math.random());
assertNotNull(memcachedClient.get(key+(SET_OPERATIONS-1)));

when mamcachedClient.get(key) is called a Timeout exception is thrown. 
[spymemcached 2.5]

Original comment by 6marti...@gmail.com on 30 Nov 2010 at 9:09

GoogleCodeExporter commented 9 years ago

@6martina6
Your test there is basically a garbage generator, since an object is created 
for every request.  Unless you've done some GC tuning, chances are high that 
the timeout you see correlates to garbage collection.

Original comment by ingen...@gmail.com on 7 Dec 2010 at 6:56

GoogleCodeExporter commented 9 years ago

I'm sorry this issue has lived on so long.  I will post some more tomorrow, but 
the good news is that I've been able to make good progress and can document 
what I believe to be the answers.  I'll probably also ask some of you for help 
in testing.

Just to summarize quickly, the way things had been implemented, if a timeout 
occurred for any reason (JIT, GC, network retransmit) it was possible for more 
operations to stack up which would also timeout, but they would still be 
transmitted over the network.  This meant when things got bad for a little bit, 
they'd have to get really good for a while to have a chance at catching up.

I have a fix that will ensure this doesn't happen.  I expect that in cases like 
ragnor84's, we'll still see a big "bubble" of timeouts, but things will recover 
very quickly.

The other thing I want to do is document a bit better how to set the default 
timeout with a ConnectionFactoryBuilder.  I also want to test and document a 
few JVM params that could be tried.  I may need help with this too.

None of this will insulate you from timeouts though.  Running in a JVM on a 
system (especially if it's virtualized) on a TCP network you can control a lot 
of things, but most of them don't offer guarantees out of the box.  In some 
cases, you can really never get a complete guarantee.

That said, I believe it can be better so we'll make it better.

Original comment by ingen...@gmail.com on 7 Dec 2010 at 7:10

GoogleCodeExporter commented 9 years ago

Subsequent to my posting here the other day, we've been running my code changes 
in a long term test.  Looks good so far.  Also, this exposed some issues with 
existing tests that I'm in the process of cleaning up.

If anyone else wants to help test, please let me know.

Original comment by ingen...@gmail.com on 9 Dec 2010 at 5:38

GoogleCodeExporter commented 9 years ago

Good to see progress on this issue. I would like to help testing.

Original comment by ilkinulas on 12 Dec 2010 at 6:07

GoogleCodeExporter commented 9 years ago

We are using spymemcached for a reasonably high volume website and are 
experiencing this issue.  In particular, comment 40 is eerily familiar. 

We have 5 memcached servers, and 9 clients. At seemingly random times, but 
usually when under decent load (12-15 page requests per second, per client, or 
roughly 200 memcached requests per second per client, CPU on all of the 
application servers increases quite noticeably, until 1 or more servers goes to 
100% CPU and all application threads are hung waiting for spymemcached.

Strangely, with 1 or 2 servers completely crashing, the other servers often 
seem to benefit, with decreased CPU usage and faster execution times.

After restarting the application on a failed server, we have to warm it up very 
slowly, a few requests at a time, before reintroducing the server into the load 
balancer. Otherwise we'll get hung requests almost immediately.  Sometimes even 
after "warming up" we'll get hung requests within 2-10 minutes. If the server 
is stable for an hour or so, it will probably be fine for a day or two at 
least, until the issue reoocurs on the same or another server.

If possible, my team would love to participate in testing the fix for this, as 
our other options are fairly unattractive.

Original comment by brianchr...@gmail.com on 14 Dec 2010 at 9:36

GoogleCodeExporter commented 9 years ago

great there are some folks willing to check it out :)

I was trying to get something out today, but it'll take until tomorrow.  

Mike has been running the one above long term, but when I wrote some tests I 
found some others were broken, so it's been a steady process of getting good 
tests and correcting existing ones.  It turns out there were a number of tests 
that probably should have been timing out, but weren't because there was no 
real check.

@brian: the fix, I believe, could with your situation.  I've only had toy tests 
to try it with though.  I did try it with a large number of threads on a system 
with many CPU though.

the only concern I have is that what you describe could also be associated with 
VM memory pressure.  Have you tried logging GCs to see if you could correlate.  
More specifically, if under load you have higher latencies, you'd start seeing 
more CPU usage in garbage collection time on the tenured spaces.  Dedicating 
more to the young/new spaces may help keep that down.

That said, I do think my fixes help with relieving the pressure if you start 
getting into a bad situation, so you should recover more quickly.

Original comment by ingen...@gmail.com on 17 Dec 2010 at 6:54

GoogleCodeExporter commented 9 years ago

My changes have had one test in particular failing for some time.  At first, I 
thought my changes were causing it, but after some investigation and discussion 
with Dustin, it appears that it's a real resurgence of issue 35.

I can, and will, still post a test version of the client, but I'll need to fix 
this issue too since it may be another cause of the same symptom.

Original comment by ingen...@gmail.com on 27 Dec 2010 at 4:05

nidgupta / spymemcached

unrecoverable after CheckedOperationTimeoutException #136