tikv / raft-rs

Raft distributed consensus algorithm implemented in Rust.
Apache License 2.0
2.86k stars 391 forks source link

Initialization with term=0 shouldn't be allowed #511

Open GuyLewin opened 1 year ago

GuyLewin commented 1 year ago

Describe the bug Nodes often crashing (panic) during first few moments of Raft, when using distinct priority values per node together with pre_vote = true. The least-prioritized node never crashes.

When I looked into it, I found that it's because during the pre-voting, we reject pre-voting requests from lower-priority nodes: https://github.com/tikv/raft-rs/blob/82d704cdc3d93258be1f45efd715b95930764d7f/src/raft.rs#L1467C65-L1467C98

We then try to send the rejection back to the proposing node, including our term as the message's term: https://github.com/tikv/raft-rs/blob/82d704cdc3d93258be1f45efd715b95930764d7f/src/raft.rs#L1494C21-L1498C58

But during that send, term is checked to be != 0, resulting in a panic instead of sending the proposal rejection: https://github.com/tikv/raft-rs/blob/82d704cdc3d93258be1f45efd715b95930764d7f/src/raft.rs#L617C9-L640C14

Expected behavior PreVoting rejection and not panic().

BusyJay commented 1 year ago

Then term should be assigned to vote response.

GuyLewin commented 1 year ago

Then term should be assigned to vote response.

It is assigned, but the term is still 0 when the node initializes, unless I misunderstood what you were referring to. I can attach the full backtrace if that helps, but the fatal is being called in this scenario pretty consistently.

BusyJay commented 1 year ago

I see. When the raft group is first initialized, it should not have term 0 (INVALID_TERM).

GuyLewin commented 1 year ago

@BusyJay makes sense, thank you! I will initialize all our nodes with term=1 in their Raft state.

tisonkun commented 1 year ago

it should not have term 0

Perhaps we check it on initial state loaded? I encountered this issue days before also.

BusyJay commented 1 year ago

There are two cases: 1. constructing the raft node with existing data, so the node is already initialized and should have a valid term; 2. constructing the raft node without any data, so the node is not initialized and can have term 0.

Though adding a check that term must not be 0 if configuration is not empty should be OK.

GuyLewin commented 1 year ago

Though adding a check that term must not be 0 if configuration is not empty should be OK.

I can work on a PR for that. We were constructing a raft node without any data, but using pre-vote and priority so we ended up rejecting even before any data was sent, therefore getting this error during rejection message send. But anyway, this check will help ensure this never happens.

BusyJay commented 1 year ago

Please send it when you are free, thanks!

tisonkun commented 1 year ago

Reopened

GuyLewin commented 1 year ago

Created a PR - #513