mfp / oraft

Raft consensus algorithm implementation
Other
33 stars 5 forks source link

Bug with the test application #8

Closed vbmithr closed 5 years ago

vbmithr commented 5 years ago

Hey @mfp !

I just did some test with the dict application, launching 3 instances, 2 connected to the same, and observe that when I kill one, sometimes the network do not recover from this event, fails to elect a new leader, and so on.

Did you notice such an issue?

mfp commented 5 years ago

On Thu, Nov 08, 2018 at 10:03:31AM +0000, Vincent Bernardoff wrote:

Hey @mfp !

I just did some test with the dict application, launching 3 instances, 2 connected to the same, and observe that when I kill one, sometimes the network do not recover from this event, fails to elect a new leader, and so on.

Did you notice such an issue?

I haven't used in anger the dict example, but I did perform extensive discrete-event simulation of the core RAFT state machine which simulated all kinds of network failures across millions of operations, so hopefully the problem lies in the "superficial" code in dict that performs operation (de)serialization or in outer layers of the networking code. I will take a look this weekend.

-- Mauricio Fernández

vbmithr commented 5 years ago

I'm also having a look. One example of a bug:

Result: the first server loop connecting to the dead server and never gets elected.

mfp commented 5 years ago

On Thu, Nov 08, 2018 at 05:35:21AM -0800, Vincent Bernardoff wrote:

I'm also having a look. One example of a bug:

  • Launch one server
  • Launch one second instance connecting to the first
  • Kill the second instance

Result: the first server loop connecting to the dead server and never gets elected.

Well, you cannot have quorum (for election or to persist operations) if you have only two instances and one is dead :)

However, in the first message you talked about a 3-node cluster being unable to elect a new leader with 2 live ones, which a priori could be a legitimate error.

-- Mauricio Fernández

vbmithr commented 5 years ago

On 11/8/18 4:03 PM, mfp wrote:

Well, you cannot have quorum (for election or to persist operations) if you have only two instances and one is dead :)

Fair enough. What about this bug?

2 servers:

When I kill #1, #2 does not try to reconnect to #1, whereas the symetrical is false.

mfp commented 5 years ago

On Thu, Nov 08, 2018 at 03:11:26PM +0000, Vincent Bernardoff wrote:

On 11/8/18 4:03 PM, mfp wrote:

Well, you cannot have quorum (for election or to persist operations) if you have only two instances and one is dead :)

Fair enough. What about this bug?

2 servers:

  • 1 connected to nothing
  • 1 connected to the first

When I kill #1, #2 does not try to reconnect to #1, whereas the symetrical is false.

I don't have the full Raft consensus algorithm in mind at the moment, but IIRC there was an asymmetry right there in the algo, where only the leader would send periodically Ping messages (heartbeat) and followers would try to trigger a new election after a different (election) timeout.

These timeout events are triggered in Oraft_lwt, not tested with DES, so there could be a bug there, which would actually be good news compared to having a bug in the core algo, way trickier. Even though the Raft paper reads very easily, it is easy to introduce bugs in the core, which is why I tested it as comprehensively as possible via DES. I did catch several bugs that way. I remember referring to other Raft implementations and finding they failed to address the cases I discovered.

-- Mauricio Fernández

vbmithr commented 5 years ago

Ok I managed to do what I wanted with the code, the issue was that dict tries to connect to the peer in --join so if the peer is not online at this time, the program stops immediately. I replaced --join with --peers where one can specify network config, see my PR (#7).