We found a bug that causes a buffer overflow on the leader when a lagging follower tries to recover. The stack overflow seems to occur within the recursive function “restore_from_log” (Shim.ml) when a very large packet is constructed and before the leader actually tries to send it.
This problem can be reproduced through the following process:
a) start 3 servers;
b) execute one client request;
c) stop a follower server;
d) execute many client requests (in our tests, at least 521,932 requests).
c) restart the server that was stopped
Here’s a sample output produced by the leader when it crashes:
[Term 1] Sending 50 entries to 2 (currently have 521932 entries), commitIndex=521882_
[Term 1] Sending 521881 entries to 3 (currently have 521932 entries), commitIndex=521882_
[Term 1] Received AppendEntriesReply 50 entries true, commitIndex 521883
Fatal error: exception Stack overflow
From @pfons on April 13, 2016 0:11
We found a bug that causes a buffer overflow on the leader when a lagging follower tries to recover. The stack overflow seems to occur within the recursive function “restore_from_log” (Shim.ml) when a very large packet is constructed and before the leader actually tries to send it.
This problem can be reproduced through the following process: a) start 3 servers; b) execute one client request; c) stop a follower server; d) execute many client requests (in our tests, at least 521,932 requests). c) restart the server that was stopped
Here’s a sample output produced by the leader when it crashes:
Copied from original issue: uwplse/verdi#37