Ensure client fails over to another node when using Membase

GoogleCodeExporter commented 9 years ago

What version of the product are you using? On what operating system?
I use version 2.7 from your maven repo.

Tell me more...
I have a cluster of two machines (192.168.1.10 and 192.168.1.9)
I connect to membase cluster as you suggest (everything ok).
I manually kill one node (192.168.1.9).
Application fail with errors:

2011-06-15 17:32:33.405 INFO net.spy.memcached.MemcachedConnection:  Added {QA 
sa=/192.168.1.9:11211, #Rops=0, #Wops=0, #iq=0, topRop=null, topWop=null, 
toWrite=0, interested=0} to connect queue
2011-06-15 17:32:33.407 INFO net.spy.memcached.MemcachedConnection:  Added {QA 
sa=/192.168.1.10:11211, #Rops=0, #Wops=0, #iq=0, topRop=null, topWop=null, 
toWrite=0, interested=0} to connect queue
2011-06-15 17:32:33.412 INFO net.spy.memcached.MemcachedConnection:  Connection 
state changed for sun.nio.ch.SelectionKeyImpl@63238bd2
2011-06-15 17:32:33.413 INFO net.spy.memcached.MemcachedConnection:  Connection 
state changed for sun.nio.ch.SelectionKeyImpl@37bd2664
2011-06-15 18:20:21.896 INFO net.spy.memcached.MemcachedConnection:  
Reconnecting due to exception on {QA sa=/192.168.1.9:11211, #Rops=2, #Wops=0, 
#iq=0, topRop=net.spy.memcached.protocol.binary.StoreOperationImpl@5f4275d4, 
topWop=null, toWrite=0, interested=1}
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcher.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:237)
    at sun.nio.ch.IOUtil.read(IOUtil.java:210)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
    at net.spy.memcached.MemcachedConnection.handleReads(MemcachedConnection.java:487)
    at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:427)
    at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:280)
    at net.spy.memcached.MemcachedClient.run(MemcachedClient.java:2063)
2011-06-15 18:20:21.897 WARN net.spy.memcached.MemcachedConnection:  Closing, 
and reopening {QA sa=/192.168.1.9:11211, #Rops=2, #Wops=0, #iq=0, 
topRop=net.spy.memcached.protocol.binary.StoreOperationImpl@5f4275d4, 
topWop=null, toWrite=0, interested=1}, attempt 0.
2011-06-15 18:20:21.898 WARN 
net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl:  Discarding 
partially completed op: 
net.spy.memcached.protocol.binary.StoreOperationImpl@5f4275d4
2011-06-15 18:20:21.899 WARN 
net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl:  Discarding 
partially completed op: 
net.spy.memcached.protocol.binary.GetOperationImpl@802b249
Exception in thread "main" java.lang.RuntimeException: Exception waiting for 
value
    at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1146)
    at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1163)

Original issue reported on code.google.com by bouri...@gmail.com on 16 Jun 2011 at 6:04

GoogleCodeExporter commented 9 years ago

Added: Node 192.168.1.10 still working but not getting any load on it.

When I call: http://192.168.1.10:8091/pools, I get back:

{"pools":[{"name":"default","uri":"/pools/default","streamingUri":"/poolsStreami
ng/default"}],"isAdminCreds":false,"uuid":"581cefce-bea5-4001-15e8-ad67000000ea"
,"implementationVersion":"1.7.0","componentsVersion":{"os_mon":"2.2.5","mnesia":
"4.4.17","inets":"5.5.2","kernel":"2.14.3","sasl":"2.1.9.3","ns_server":"1.7.0",
"stdlib":"1.17.3"}}

Original comment by bouri...@gmail.com on 16 Jun 2011 at 6:06

GoogleCodeExporter commented 9 years ago

My issue is simular to Issue #108 and Issue #180

Original comment by bouri...@gmail.com on 16 Jun 2011 at 6:16

GoogleCodeExporter commented 9 years ago

The way Membase works, the failure of a node will cause errors at the client 
level.  Are you saying you cannot get any operations through to any other node 
of the cluster?

Do things recover when you hit the "failover" button in Membase's web UI?

If you believe there is an issue here, please post a test that demonstrates 
what you think is wrong.

Original comment by ingen...@gmail.com on 23 Jun 2011 at 5:45

Changed state: NeedInfo

GoogleCodeExporter commented 9 years ago

"Failover" in membase UI works as expected. I perfectly handle it on the client 
side.

Manual shut down of node via terminal ($kill something or $sudo 
/etc/init.d/membase-server stop) crushes the client.

My cluster setup is describe in the issue body above. To test failure you can 
run code similar to this one:

import net.spy.memcached.AddrUtil;
import net.spy.memcached.BinaryConnectionFactory;
import net.spy.memcached.MemcachedClient;

import java.io.IOException;

public class memcache_test2 {
    public static void main(String[] args) throws IOException {

        MemcachedClient c = new MemcachedClient(new BinaryConnectionFactory(), AddrUtil.getAddresses("192.168.1.9:11211 192.168.1.10:11211"));
        String result;
        for (int j = 0; j < 100; j++) {
            for (int i = 0; i < 100000; i++) {
                c.set("hello" + i, 0, "world" + i);
                result = (String) c.get("hello" + i);
            }
        }
    }
}

During the code execution shutdown any node via terminal ($sudo 
/etc/init.d/membase-server stop) and you will get exception that ruins 
everything. If you do "failover" via Membase UI - spymemcached detects node 
failure properly and acts as expected (tried to know to failed node for some 
while and then switches to alive node).

p.s. Discussion about the same issue: 
http://www.couchbase.org/forums/thread/any-good-example-java-code-handles-node-f
ault#comment-1003508

Original comment by bouri...@gmail.com on 23 Jun 2011 at 5:57

GoogleCodeExporter commented 9 years ago

>The way Membase works, the failure of a node will cause errors at the client 
level.  Are you saying you cannot get any operations through to any other node 
of the cluster?

Yes. After manual shut down of a random node client cannot access any other 
nodes.

>Do things recover when you hit the "failover" button in Membase's web UI?

Via UI everything ok. Failover via UI works fine.

>If you believe there is an issue here, please post a test that demonstrates 
what you think is wrong.

Already posted.

Original comment by bouri...@gmail.com on 23 Jun 2011 at 7:12

GoogleCodeExporter commented 9 years ago

Any suggestions for this issue?

Original comment by bouri...@gmail.com on 27 Jun 2011 at 4:31

GoogleCodeExporter commented 9 years ago

I'm reopening this for further investigation so it doesn't get lost.

Original comment by dsalli...@gmail.com on 27 Jun 2011 at 8:14

Changed state: New

GoogleCodeExporter commented 9 years ago

This issue has been addressed in 2.7.2.  The problem was, it wouldn't get an 
updated configuration.

Original comment by ingen...@gmail.com on 14 Oct 2011 at 7:19

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

And where is the example?

Original comment by bouri...@gmail.com on 14 Oct 2011 at 7:42

GoogleCodeExporter commented 9 years ago

Well, this is an issue tracking system, not a FAQ system.  :)

In 2.7.2, I've added a test and verified if the list of URIs has down/dead 
nodes in it, it will still find a live node and configure itself to do the 
right thing.  If the cluster topology changes, it then adjusts to the topology 
and does the right thing.

That said, I just found that that commit was forgotten.  I'll need to fix that. 
 Here comes 2.7.3.  

You can see the change here:
http://review.couchbase.org/#change,10026

Original comment by ingen...@gmail.com on 14 Oct 2011 at 7:53

Changed state: Accepted
Added labels: Milestone-Release2.7.3

GoogleCodeExporter commented 9 years ago

That change has now been committed and is in the 2.7.3 release.

Original comment by ingen...@gmail.com on 15 Oct 2011 at 3:07

Changed title: Ensure client fails over to another node when using Membase
Changed state: Fixed

roc230 / spymemcached

Ensure client fails over to another node when using Membase #181