The master/slave connection protocol does not handle certain corner cases that
arise when the master log does not contain values needed by the slave.
Case 1: Master log starts at higher value than that needed by slave.
1. Start up a master and a slave with service name "foo" and confirm they are
connected.
2. Stop the slave.
3. Perform one or more transactions on the master MySQL instance.
4. Stop the master, clear the THL log files, but *do not* clear the value in
tungsten_foo.
5. Restart the master. The master will start numbering its log higher than the
slave's next required sequence number.
6. Restart the slave.
At this point, the slave will print the following:
$trepctl status
...
pendingError : Event extraction failed: Client handshake failure:
Client response validation failed: Client log has higher sequence number than
master: client source ID=logos2 seqno=0 client epoch number=0
...
This message is false. It should say that the master could not find the
requested ID. Here is a stack trace that shows where the error arises on the
master.
INFO | jvm 1 | 2011/04/23 09:08:35 |
com.continuent.tungsten.replicator.thl.THLException: Client log has higher
sequence number than master: client source ID=logos2 seqno=0 client epoch
number=0
INFO | jvm 1 | 2011/04/23 09:08:35 | at
com.continuent.tungsten.replicator.thl.ConnectorHandler$LogValidator.validateRes
ponse(ConnectorHandler.java:94)
INFO | jvm 1 | 2011/04/23 09:08:35 | at
com.continuent.tungsten.replicator.thl.Protocol.serverHandshake(Protocol.java:21
6)
INFO | jvm 1 | 2011/04/23 09:08:35 | at
com.continuent.tungsten.replicator.thl.ConnectorHandler.run(ConnectorHandler.jav
a:179)
INFO | jvm 1 | 2011/04/23 09:08:35 | at
java.lang.Thread.run(Thread.java:636)
Case 2: Master starts at higher value than uninitialized slave.
1. Create a new master and slave on service foo but do not start them.
2. Start the master only.
3. Perform one or more transactions on the master MySQL instance.
4. Stop the master, clear the THL log files, but *do not* clear the value in
tungsten_foo.
5. Restart the master. The master will start numbering its log higher than the
slave's next required sequence number.
6. Start the slave.
In this case the slave just hangs in the GOING-ONLINE:SYNCHRONIZING. It will
try to keep reconnecting to the master without signaling an error. You must
kill the slave process as it does not respond to 'trepctl offline'. On some
systems the JVM will run out of file descriptors and print a message like the
following:
INFO | jvm 1 | 2011/04/19 13:10:12 | WARNING: RMI TCP Accept-0:
accept loop for
ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=52405] throws
INFO | jvm 1 | 2011/04/19 13:10:12 | java.net.SocketException:
Too many open files
INFO | jvm 1 | 2011/04/19 13:10:12 | at
java.net.PlainSocketImpl.socketAccept(Native Method)
INFO | jvm 1 | 2011/04/19 13:10:12 | at
java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
INFO | jvm 1 | 2011/04/19 13:10:12 | at
java.net.ServerSocket.implAccept(ServerSocket.java:462)
INFO | jvm 1 | 2011/04/19 13:10:12 | at
java.net.ServerSocket.accept(ServerSocket.java:430)
INFO | jvm 1 | 2011/04/19 13:10:12 | at
sun.rmi.transport.tcp.TCPTransport
$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
INFO | jvm 1 | 2011/04/19 13:10:12 | at
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:
341)
INFO | jvm 1 | 2011/04/19 13:10:12 | at
java.lang.Thread.run(Thread.java:662)
Original issue reported on code.google.com by berkeley...@gmail.com on 23 Apr 2011 at 4:31
Original issue reported on code.google.com by
berkeley...@gmail.com
on 23 Apr 2011 at 4:31