mgalushka / spymemcached

Automatically exported from code.google.com/p/spymemcached
0 stars 0 forks source link

Failover did not work when 1 of 4 memcached was down #251

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What version of the product are you using? 2.7 but the same problem also occurs 
after updating to 2.8.1
One what operation system? client: Linux, Memcached server: version 1.2.2 on 
Linux

One of our four memcached servers did not respond. But the automatic failover 
to the other servers did not work.

The memcached did accept socket connections, but never returned a result via 
this socket connection.

The failover works correctly if the memcached is down, but the failover does 
not work if the memcached accept connections, but never returns a response.

This issue can be reproduced by running a simple "AcceptingServer" which 
simulates the failed memcached:

-- Start snipplet ---------------------------------------------
import java.net.ServerSocket;

public class AcceptingServer {

    public static void main(String args[]) throws Exception {
        ServerSocket server = new ServerSocket(11212);

        while(true) {
            server.accept();
            System.out.println("Accepted one connection");
        }
    }
}
-- End snipplet ---------------------------------------------

The following class uses a list of memcached servers, one server in this list 
is the "AcceptingServer":

-- Start snipplet ---------------------------------------------
import java.net.InetSocketAddress;
import java.util.ArrayList;
import java.util.List;

public class MemcachedStandaloneTest {

    public static void main(String args[]) throws Exception {
        List<InetSocketAddress> addressList = new ArrayList<InetSocketAddress>();
        addressList.add(new InetSocketAddress("memcached-hostname",11211));
        addressList.add(new InetSocketAddress("localhost",11212));
        net.spy.memcached.MemcachedClient client = new net.spy.memcached.MemcachedClient(addressList);

        while(true) {
            try {
                System.out.println("Value 1: " + client.get("key1") );
            } catch(Exception e) {
                System.out.println("Exception when getting value 1: " + e.getMessage());
                e.printStackTrace();
            }
            try {
                System.out.println("Value 2: " + client.get("key2") );
            } catch(Exception e) {
                System.out.println("Exception when getting value 2: " + e.getMessage());
                e.printStackTrace();
            }
            Thread.sleep(5000);
            System.out.println("Unavailable servers: " + client.getUnavailableServers() );
        }
    }
}
-- End snipplet ---------------------------------------------

The program "MemcachedStandaloneTest" will never notice that one of the two 
memcached does not work,
even if the program runs several minutes. This program constantly produced the 
following log output:

Value 1: null
Exception when getting value 2: Timeout waiting for value
net.spy.memcached.OperationTimeoutException: Timeout waiting for value
    at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1003)
    at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1018)
    at com.unitedinternet.cloudintegration.cache.memcached.MemcachedStandaloneTest.main(MemcachedStandaloneTest.java:24)
Caused by: net.spy.memcached.internal.CheckedOperationTimeoutException: Timed 
out waiting for operation - failing node: localhost/127.0.0.1:11212
    at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:93)
    at net.spy.memcached.internal.GetFuture.get(GetFuture.java:62)
    at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:997)
    ... 2 more
Unavailable servers: []

Original issue reported on code.google.com by uwestah...@gmail.com on 22 May 2012 at 2:49

GoogleCodeExporter commented 9 years ago
There is a continuous timeout threshold that should see the node as down and 
then try to reestablish the connection.  In this case, it will be successful 
reestablishing it and start to timeout all over again.  The normal case for 
data going to nowhere is when a server crashes or unexpectedly loses power.

If you're using redistribute and change your AcceptingServer to only accept the 
connection once, do you see the correct behavior?  The default is classic 
modulus hashing, not Ketama hashing with redistribute.  Look at the 
ConnectionFactoryBuilder as a way to set different parameters, including the 
timeout threshold.

 I'm pretty confident in this functionality, as I'd re-tested it again recently.

Original comment by ingen...@gmail.com on 22 May 2012 at 3:57

GoogleCodeExporter commented 9 years ago
Let me try to explain our issue in another way:

Last week, our production system did not work.
We are using four memcached's. The production system did not work for several 
hours, until one of the memcached's (idm-sessmw04) was restarted.
We constantly got the following exception:

net.spy.memcached.OperationTimeoutException: Timeout waiting for value
      at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:924)
      at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:939)
      ...
Caused by: net.spy.memcached.internal.CheckedOperationTimeoutException: Timed 
out waiting for operation - failing node: 
idm-sessmw04.mydomain.de/172.123.123.123:11211
      at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:65)
      at net.spy.memcached.internal.GetFuture.get(GetFuture.java:37)
      at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:917)
      ...

It was possible to establish a telnet connection to port 11211 of idm-sessmw04 
while this problem has shown up.
Therefore I assume that the memcached on idm-sessmw04 did accept connections, 
but never returned any response.

The problem was that the failover of spymemcached did not work. The fourth 
memcached did not work for hours and spymemcached did never mark that memcached 
as unavailable.

I was not able to find out the reason for this strane behavious of memcached, 
therefore I wrote the class AcceptingServer which simulated the behaviour of 
the memcached as I have noticed it.
The class MemcachedStandaloneTest then demonstrates that the failover does not 
work, if the memcached accepts connections, but never returns a 
responses/always causes a timeout.

Currently, spymemcached only adds those hosts to the list of unavailble servers 
which do not accept connections. 
I would suggest that spymemcached also adds those hosts to the list of 
unavailable servers that accept connections, but never return a response.

Original comment by uwestah...@gmail.com on 23 May 2012 at 5:54