Tungsten slaves react poorly when they cannot obtain needed sequence number from master

The master/slave connection protocol does not handle certain corner cases that 
arise when the master log does not contain values needed by the slave.  

Case 1:  Master log starts at higher value than that needed by slave. 

1. Start up a master and a slave with service name "foo" and confirm they are 
connected. 
2. Stop the slave. 
3. Perform one or more transactions on the master MySQL instance. 
4. Stop the master, clear the THL log files, but *do not* clear the value in 
tungsten_foo. 
5. Restart the master.  The master will start numbering its log higher than the 
slave's next required sequence number.  
6. Restart the slave.  

At this point, the slave will print the following: 

$trepctl status
...
pendingError           : Event extraction failed: Client handshake failure: 
Client response validation failed: Client log has higher sequence number than 
master: client source ID=logos2 seqno=0 client epoch number=0
...

This message is false.  It should say that the master could not find the 
requested ID.  Here is a stack trace that shows where the error arises on the 
master. 

INFO   | jvm 1    | 2011/04/23 09:08:35 | 
com.continuent.tungsten.replicator.thl.THLException: Client log has higher 
sequence number than master: client source ID=logos2 seqno=0 client epoch 
number=0
INFO   | jvm 1    | 2011/04/23 09:08:35 |   at 
com.continuent.tungsten.replicator.thl.ConnectorHandler$LogValidator.validateRes
ponse(ConnectorHandler.java:94)
INFO   | jvm 1    | 2011/04/23 09:08:35 |   at 
com.continuent.tungsten.replicator.thl.Protocol.serverHandshake(Protocol.java:21
6)
INFO   | jvm 1    | 2011/04/23 09:08:35 |   at 
com.continuent.tungsten.replicator.thl.ConnectorHandler.run(ConnectorHandler.jav
a:179)
INFO   | jvm 1    | 2011/04/23 09:08:35 |   at 
java.lang.Thread.run(Thread.java:636)

Case 2: Master starts at higher value than uninitialized slave. 

1. Create a new master and slave on service foo but do not start them. 
2. Start the master only.  
3. Perform one or more transactions on the master MySQL instance. 
4. Stop the master, clear the THL log files, but *do not* clear the value in 
tungsten_foo. 
5. Restart the master.  The master will start numbering its log higher than the 
slave's next required sequence number.  
6. Start the slave.  

In this case the slave just hangs in the GOING-ONLINE:SYNCHRONIZING.  It will 
try to keep reconnecting to the master without signaling an error.  You must 
kill the slave process as it does not respond to 'trepctl offline'.  On some 
systems the JVM will run out of file descriptors and print a message like the 
following: 

INFO   | jvm 1    | 2011/04/19 13:10:12 | WARNING: RMI TCP Accept-0:
accept loop for
ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=52405] throws
INFO   | jvm 1    | 2011/04/19 13:10:12 | java.net.SocketException:
Too many open files
INFO   | jvm 1    | 2011/04/19 13:10:12 |       at
java.net.PlainSocketImpl.socketAccept(Native Method)
INFO   | jvm 1    | 2011/04/19 13:10:12 |       at
java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
INFO   | jvm 1    | 2011/04/19 13:10:12 |       at
java.net.ServerSocket.implAccept(ServerSocket.java:462)
INFO   | jvm 1    | 2011/04/19 13:10:12 |       at
java.net.ServerSocket.accept(ServerSocket.java:430)
INFO   | jvm 1    | 2011/04/19 13:10:12 |       at
sun.rmi.transport.tcp.TCPTransport
$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
INFO   | jvm 1    | 2011/04/19 13:10:12 |       at
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:
341)
INFO   | jvm 1    | 2011/04/19 13:10:12 |       at
java.lang.Thread.run(Thread.java:662)
Original issue reported on code.google.com by berkeley...@gmail.com on 23 Apr 2011 at 4:31
jakemctigue / tungsten-replicator

Tungsten slaves react poorly when they cannot obtain needed sequence number from master #32