tikv / raft-rs

Raft distributed consensus algorithm implemented in Rust.
Apache License 2.0
2.93k stars 394 forks source link

Snapshot gets rejected unless the node is already a part of the ConfState #472

Closed rosik closed 2 years ago

rosik commented 2 years ago

Describe the bug A node rejects a snapshot unless it's in the ConfState.

I've modified five_mem_node example to reproduce the problem. See https://github.com/rosik/raft-rs/commit/97bb292a8851ed9becbc29d4287c4a10fa93d143.

Instead of 5 voters now there's 1 voter and 1 learner. Upon the learner initialization it rejects MsgAppend (due to the log gap) and the voter decides to send a snapshot. Now, instead of generating a new snapshot for every confchange the voter sends the initial snapshot, not the latest one. Since its index is still > request_index, I expect it to be applied, but raft complains:

DEBG Sending from 1 to 2, msg: msg_type: MsgSnapshot to: 2 snapshot {metadata {conf_state {voters: 1} index: 1 term: 1}}, to: 2, from: 1, raft_id: 1, tag: peer_1
WARN attempted to restore snapshot but it is not in the ConfState, conf_state: voters: 1, raft_id: 2, tag: peer_2

Expected behavior Snapshot is applied. And all subsequent log entries too.

To Reproduce

git clone git@github.com:rosik/raft-rs.git
cd raft-rs/
git checkout rejected-snapshot-repro
cargo run --example five_mem_node

Additional context

Logs ```text Apr 01 16:30:57.683 INFO switched to configuration, config: Configuration { voters: Configuration { incoming: Configuration { voters: {1} }, outgoing: Configuration { voters: {} } }, learners: {}, learners_next: {}, auto_leave: false }, raft_id: 1, tag: peer_1 Apr 01 16:30:57.683 DEBG reset election timeout 0 -> 16 at 0, election_elapsed: 0, timeout: 16, prev_timeout: 0, raft_id: 1, tag: peer_1 Apr 01 16:30:57.683 INFO became follower at term 1, term: 1, raft_id: 1, tag: peer_1 Apr 01 16:30:57.684 INFO newRaft, peers: Configuration { incoming: Configuration { voters: {1} }, outgoing: Configuration { voters: {} } }, last term: 1, last index: 1, applied: 1, commit: 1, term: 1, raft_id: 1, tag: peer_1 Apr 01 16:30:57.684 INFO RawNode created with id 1., id: 1, raft_id: 1, tag: peer_1 Apr 01 16:30:59.304 INFO starting a new election, term: 1, raft_id: 1, tag: peer_1 Apr 01 16:30:59.304 DEBG reset election timeout 16 -> 12 at 0, election_elapsed: 0, timeout: 12, prev_timeout: 16, raft_id: 1, tag: peer_1 Apr 01 16:30:59.305 INFO became candidate at term 2, term: 2, raft_id: 1, tag: peer_1 Apr 01 16:30:59.305 DEBG reset election timeout 12 -> 11 at 0, election_elapsed: 0, timeout: 11, prev_timeout: 12, raft_id: 1, tag: peer_1 Apr 01 16:30:59.306 INFO became leader at term 2, term: 2, raft_id: 1, tag: peer_1 Apr 01 16:30:59.306 DEBG persisted index 3, raft_id: 1, tag: peer_1 Apr 01 16:30:59.306 DEBG committing index 3, index: 3, raft_id: 1, tag: peer_1 Apr 01 16:30:59.307 INFO switched to configuration, config: Configuration { voters: Configuration { incoming: Configuration { voters: {1} }, outgoing: Configuration { voters: {} } }, learners: {2}, learners_next: {}, auto_leave: false }, raft_id: 1, tag: peer_1 Apr 01 16:30:59.308 DEBG Sending from 1 to 2, msg: msg_type: MsgAppend to: 2 log_term: 2 index: 2 entries {entry_type: EntryConfChange term: 2 index: 3 data: "\020\002\030\002"} commit: 3, to: 2, from: 1, raft_id: 1, tag: peer_1 Apr 01 16:30:59.308 INFO We get a 2 nodes Raft cluster now, now propose 3 proposals Apr 01 16:30:59.319 DEBG persisted index 4, raft_id: 1, tag: peer_1 Apr 01 16:30:59.319 DEBG committing index 4, index: 4, raft_id: 1, tag: peer_1 Apr 01 16:30:59.323 INFO switched to configuration, config: Configuration { voters: Configuration { incoming: Configuration { voters: {} }, outgoing: Configuration { voters: {} } }, learners: {}, learners_next: {}, auto_leave: false }, raft_id: 2, tag: peer_2 Apr 01 16:30:59.323 DEBG reset election timeout 0 -> 19 at 0, election_elapsed: 0, timeout: 19, prev_timeout: 0, raft_id: 2, tag: peer_2 Apr 01 16:30:59.323 INFO became follower at term 0, term: 0, raft_id: 2, tag: peer_2 Apr 01 16:30:59.323 INFO newRaft, peers: Configuration { incoming: Configuration { voters: {} }, outgoing: Configuration { voters: {} } }, last term: 0, last index: 0, applied: 0, commit: 0, term: 0, raft_id: 2, tag: peer_2 Apr 01 16:30:59.323 INFO RawNode created with id 2., id: 2, raft_id: 2, tag: peer_2 Apr 01 16:30:59.323 INFO received a message with higher term from 1, msg type: MsgAppend, message_term: 2, term: 0, from: 1, raft_id: 2, tag: peer_2 Apr 01 16:30:59.324 DEBG reset election timeout 19 -> 19 at 0, election_elapsed: 0, timeout: 19, prev_timeout: 19, raft_id: 2, tag: peer_2 Apr 01 16:30:59.324 INFO became follower at term 2, term: 2, raft_id: 2, tag: peer_2 Apr 01 16:30:59.324 DEBG rejected msgApp [logterm: 2, index: 2] from 1, logterm: Ok(0), index: 2, from: 1, msg_index: 2, msg_log_term: 2, raft_id: 2, tag: peer_2 Apr 01 16:30:59.324 DEBG Sending from 2 to 1, msg: msg_type: MsgAppendResponse to: 1 index: 2 reject: true, to: 1, from: 2, raft_id: 2, tag: peer_2 Apr 01 16:30:59.330 DEBG received msgAppend rejection, index: 2, from: 2, reject_hint_term: 0, reject_hint_index: 0, raft_id: 1, tag: peer_1 Apr 01 16:30:59.330 DEBG decreased progress of 2, progress: Progress { matched: 0, next_idx: 1, state: Probe, paused: false, pending_snapshot: 0, pending_request_snapshot: 0, recent_active: true, ins: Inflights { start: 0, count: 0, buffer: [], cap: 256, incoming_cap: None }, commit_group_id: 0, committed_index: 0 }, raft_id: 1, tag: peer_1 Apr 01 16:30:59.331 DEBG [firstindex: 2, commit: 4] sent snapshot[index: 1, term: 1] to 2, progress: Progress { matched: 0, next_idx: 1, state: Probe, paused: false, pending_snapshot: 0, pending_request_snapshot: 0, recent_active: true, ins: Inflights { start: 0, count: 0, buffer: [], cap: 256, incoming_cap: None }, commit_group_id: 0, committed_index: 0 }, to: 2, snapshot_term: 1, snapshot_index: 1, committed: 4, first_index: 2, raft_id: 1, tag: peer_1 Apr 01 16:30:59.331 DEBG paused sending replication messages to 2, progress: Progress { matched: 0, next_idx: 1, state: Snapshot, paused: false, pending_snapshot: 1, pending_request_snapshot: 0, recent_active: true, ins: Inflights { start: 0, count: 0, buffer: [], cap: 256, incoming_cap: None }, commit_group_id: 0, committed_index: 0 }, raft_id: 1, tag: peer_1 Apr 01 16:30:59.331 DEBG Sending from 1 to 2, msg: msg_type: MsgSnapshot to: 2 snapshot {metadata {conf_state {voters: 1} index: 1 term: 1}}, to: 2, from: 1, raft_id: 1, tag: peer_1 Apr 01 16:30:59.331 DEBG persisted index 5, raft_id: 1, tag: peer_1 Apr 01 16:30:59.331 DEBG committing index 5, index: 5, raft_id: 1, tag: peer_1 Apr 01 16:30:59.333 WARN attempted to restore snapshot but it is not in the ConfState, conf_state: voters: 1, raft_id: 2, tag: peer_2 Apr 01 16:30:59.333 INFO [commit: 0] ignored snapshot [index: 1, term: 1], snapshot_term: 1, snapshot_index: 1, commit: 0, raft_id: 2, tag: peer_2 Apr 01 16:30:59.333 DEBG Sending from 2 to 1, msg: msg_type: MsgAppendResponse to: 1, to: 1, from: 2, raft_id: 2, tag: peer_2 Apr 01 16:30:59.342 DEBG persisted index 6, raft_id: 1, tag: peer_1 Apr 01 16:30:59.342 DEBG committing index 6, index: 6, raft_id: 1, tag: peer_1 Apr 01 16:30:59.343 INFO Propose 3 proposals success! ```
BusyJay commented 2 years ago

As explained in comments

https://github.com/tikv/raft-rs/blob/2357cb22760719bcd107a90d1e64ef505bdb1e15/src/raft.rs#L2559-L2561

This is the expected behavior. I suggest to regenerate the snapshot if there is a conf change. Sending a snapshot is usually a heavier actions than sending entries. It may not be optimal to still catching up progress by logs as the price is paid.

rosik commented 2 years ago

Ok, thanks for the link. If I understand you correctly, the restriction is only caused by the snapshot implementation details, not the algorithm itself. As Ongaro PhD states, that:

A server accepts AppendEntries requests from a leader that is not part of the server’s latest configuration. Otherwise, a new server could never be added to the cluster (it would never accept any log entries preceding the configuration entry that adds the server).

And raft-rs handles this case properly, as far as I can judge by my tiny experiments with five_mem_node.

It may not be optimal to still catching up progress by logs as the price is paid.

I was thinking about the case, when several confchanges come one after another. While the snapshot is generated (SnapshotTemporarilyUnavailable) here comes the second request. And when the one snap finishes we have to throw it away and go generate one more for another peer.

lots of code here and there assumes that r.id is in the progress tracker.

Am I to understand that removing it isn't that easy to remove as it seems to? If yes, feel free closing the issue.

Here's the experiment 1. Use the same configuration to initialize every node ```rust MemStorage::new_with_conf_state(ConfState::from((vec![1], vec![]))) ``` 2. Propose some changes 3. Propose the configuration change, add the learner `peer_2`. 4. Propose some more changes 5. Make sure (by debug logs) the learner applies it all. ```text Apr 02 10:28:44.307 INFO switched to configuration, config: Configuration { voters: Configuration { incoming: Configuration { voters: {1} }, outgoing: Configuration { voters: {} } }, learners: {}, learners_next: {}, auto_leave: false }, raft_id: 1, tag: peer_1 Apr 02 10:28:44.308 DEBG reset election timeout 0 -> 17 at 0, election_elapsed: 0, timeout: 17, prev_timeout: 0, raft_id: 1, tag: peer_1 Apr 02 10:28:44.308 INFO became follower at term 0, term: 0, raft_id: 1, tag: peer_1 Apr 02 10:28:44.308 INFO newRaft, peers: Configuration { incoming: Configuration { voters: {1} }, outgoing: Configuration { voters: {} } }, last term: 0, last index: 0, applied: 0, commit: 0, term: 0, raft_id: 1, tag: peer_1 Apr 02 10:28:44.308 INFO RawNode created with id 1., id: 1, raft_id: 1, tag: peer_1 Apr 02 10:28:44.308 INFO We get a single-node Raft cluster now, now propose 3 proposals Apr 02 10:28:46.036 INFO starting a new election, term: 0, raft_id: 1, tag: peer_1 Apr 02 10:28:46.037 DEBG reset election timeout 17 -> 19 at 0, election_elapsed: 0, timeout: 19, prev_timeout: 17, raft_id: 1, tag: peer_1 Apr 02 10:28:46.038 INFO became candidate at term 1, term: 1, raft_id: 1, tag: peer_1 Apr 02 10:28:46.038 DEBG reset election timeout 19 -> 12 at 0, election_elapsed: 0, timeout: 12, prev_timeout: 19, raft_id: 1, tag: peer_1 Apr 02 10:28:46.039 INFO became leader at term 1, term: 1, raft_id: 1, tag: peer_1 Apr 02 10:28:46.039 DEBG persisted index 2, raft_id: 1, tag: peer_1 Apr 02 10:28:46.040 DEBG committing index 2, index: 2, raft_id: 1, tag: peer_1 Apr 02 10:28:46.051 DEBG persisted index 3, raft_id: 1, tag: peer_1 Apr 02 10:28:46.051 DEBG committing index 3, index: 3, raft_id: 1, tag: peer_1 Apr 02 10:28:46.063 DEBG persisted index 4, raft_id: 1, tag: peer_1 Apr 02 10:28:46.063 DEBG committing index 4, index: 4, raft_id: 1, tag: peer_1 Apr 02 10:28:46.066 INFO Propose 3 proposals success! Apr 02 10:28:46.066 INFO Now add the learner peer_2 Apr 02 10:28:46.076 DEBG persisted index 5, raft_id: 1, tag: peer_1 Apr 02 10:28:46.076 DEBG committing index 5, index: 5, raft_id: 1, tag: peer_1 Apr 02 10:28:46.076 INFO switched to configuration, config: Configuration { voters: Configuration { incoming: Configuration { voters: {1} }, outgoing: Configuration { voters: {} } }, learners: {2}, learners_next: {}, auto_leave: false }, raft_id: 1, tag: peer_1 Apr 02 10:28:46.078 DEBG Sending from 1 to 2, msg: msg_type: MsgAppend to: 2 log_term: 1 index: 4 entries {entry_type: EntryConfChange term: 1 index: 5 data: "\020\002\030\002"} commit: 5, to: 2, from: 1, raft_id: 1, tag: peer_1 Apr 02 10:28:46.078 INFO We get a 2 nodes Raft cluster now, now propose 3 proposals Apr 02 10:28:46.088 DEBG persisted index 6, raft_id: 1, tag: peer_1 Apr 02 10:28:46.088 DEBG committing index 6, index: 6, raft_id: 1, tag: peer_1 Apr 02 10:28:46.094 INFO switched to configuration, config: Configuration { voters: Configuration { incoming: Configuration { voters: {1} }, outgoing: Configuration { voters: {} } }, learners: {}, learners_next: {}, auto_leave: false }, raft_id: 2, tag: peer_2 Apr 02 10:28:46.094 DEBG reset election timeout 0 -> 13 at 0, election_elapsed: 0, timeout: 13, prev_timeout: 0, raft_id: 2, tag: peer_2 Apr 02 10:28:46.095 INFO became follower at term 0, term: 0, raft_id: 2, tag: peer_2 Apr 02 10:28:46.095 INFO newRaft, peers: Configuration { incoming: Configuration { voters: {1} }, outgoing: Configuration { voters: {} } }, last term: 0, last index: 0, applied: 0, commit: 0, term: 0, raft_id: 2, tag: peer_2 Apr 02 10:28:46.095 INFO RawNode created with id 2., id: 2, raft_id: 2, tag: peer_2 Apr 02 10:28:46.095 INFO received a message with higher term from 1, msg type: MsgAppend, message_term: 1, term: 0, from: 1, raft_id: 2, tag: peer_2 Apr 02 10:28:46.095 DEBG reset election timeout 13 -> 15 at 0, election_elapsed: 0, timeout: 15, prev_timeout: 13, raft_id: 2, tag: peer_2 Apr 02 10:28:46.095 INFO became follower at term 1, term: 1, raft_id: 2, tag: peer_2 Apr 02 10:28:46.096 DEBG rejected msgApp [logterm: 1, index: 4] from 1, logterm: Ok(0), index: 4, from: 1, msg_index: 4, msg_log_term: 1, raft_id: 2, tag: peer_2 Apr 02 10:28:46.096 DEBG Sending from 2 to 1, msg: msg_type: MsgAppendResponse to: 1 index: 4 reject: true, to: 1, from: 2, raft_id: 2, tag: peer_2 Apr 02 10:28:46.100 DEBG received msgAppend rejection, index: 4, from: 2, reject_hint_term: 0, reject_hint_index: 0, raft_id: 1, tag: peer_1 Apr 02 10:28:46.100 DEBG decreased progress of 2, progress: Progress { matched: 0, next_idx: 1, state: Probe, paused: false, pending_snapshot: 0, pending_request_snapshot: 0, recent_active: true, ins: Inflights { start: 0, count: 0, buffer: [], cap: 256, incoming_cap: None }, commit_group_id: 0, committed_index: 0 }, raft_id: 1, tag: peer_1 Apr 02 10:28:46.100 DEBG Sending from 1 to 2, msg: msg_type: MsgAppend to: 2 entries {term: 1 index: 1} commit: 6, to: 2, from: 1, raft_id: 1, tag: peer_1 Apr 02 10:28:46.100 DEBG persisted index 7, raft_id: 1, tag: peer_1 Apr 02 10:28:46.100 DEBG committing index 7, index: 7, raft_id: 1, tag: peer_1 Apr 02 10:28:46.105 DEBG Sending from 2 to 1, msg: msg_type: MsgAppendResponse to: 1 index: 1 commit: 1, to: 1, from: 2, raft_id: 2, tag: peer_2 Apr 02 10:28:46.105 DEBG persisted index 1, raft_id: 2, tag: peer_2 Apr 02 10:28:46.111 DEBG Sending from 1 to 2, msg: msg_type: MsgAppend to: 2 log_term: 1 index: 1 entries {term: 1 index: 2 data: "put 0 hello, world"} commit: 7, to: 2, from: 1, raft_id: 1, tag: peer_1 Apr 02 10:28:46.112 DEBG Sending from 1 to 2, msg: msg_type: MsgAppend to: 2 log_term: 1 index: 2 entries {term: 1 index: 3 data: "put 1 hello, world"} commit: 7, to: 2, from: 1, raft_id: 1, tag: peer_1 Apr 02 10:28:46.112 DEBG Sending from 1 to 2, msg: msg_type: MsgAppend to: 2 log_term: 1 index: 3 entries {term: 1 index: 4 data: "put 2 hello, world"} commit: 7, to: 2, from: 1, raft_id: 1, tag: peer_1 Apr 02 10:28:46.112 DEBG Sending from 1 to 2, msg: msg_type: MsgAppend to: 2 log_term: 1 index: 4 entries {entry_type: EntryConfChange term: 1 index: 5 data: "\020\002\030\002"} commit: 7, to: 2, from: 1, raft_id: 1, tag: peer_1 Apr 02 10:28:46.112 DEBG Sending from 1 to 2, msg: msg_type: MsgAppend to: 2 log_term: 1 index: 5 entries {term: 1 index: 6 data: "put 0 hello, world"} commit: 7, to: 2, from: 1, raft_id: 1, tag: peer_1 Apr 02 10:28:46.112 DEBG Sending from 1 to 2, msg: msg_type: MsgAppend to: 2 log_term: 1 index: 6 entries {term: 1 index: 7 data: "put 1 hello, world"} commit: 7, to: 2, from: 1, raft_id: 1, tag: peer_1 Apr 02 10:28:46.112 DEBG Sending from 1 to 2, msg: msg_type: MsgAppend to: 2 log_term: 1 index: 7 entries {term: 1 index: 8 data: "put 2 hello, world"} commit: 7, to: 2, from: 1, raft_id: 1, tag: peer_1 Apr 02 10:28:46.113 DEBG persisted index 8, raft_id: 1, tag: peer_1 Apr 02 10:28:46.113 DEBG committing index 8, index: 8, raft_id: 1, tag: peer_1 Apr 02 10:28:46.113 DEBG Sending from 1 to 2, msg: msg_type: MsgAppend to: 2 log_term: 1 index: 8 commit: 8, to: 2, from: 1, raft_id: 1, tag: peer_1 Apr 02 10:28:46.113 INFO Propose 3 proposals success! Apr 02 10:28:46.115 DEBG Sending from 2 to 1, msg: msg_type: MsgAppendResponse to: 1 index: 2 commit: 2, to: 1, from: 2, raft_id: 2, tag: peer_2 Apr 02 10:28:46.115 DEBG Sending from 2 to 1, msg: msg_type: MsgAppendResponse to: 1 index: 3 commit: 3, to: 1, from: 2, raft_id: 2, tag: peer_2 Apr 02 10:28:46.115 DEBG Sending from 2 to 1, msg: msg_type: MsgAppendResponse to: 1 index: 4 commit: 4, to: 1, from: 2, raft_id: 2, tag: peer_2 Apr 02 10:28:46.116 DEBG Sending from 2 to 1, msg: msg_type: MsgAppendResponse to: 1 index: 5 commit: 5, to: 1, from: 2, raft_id: 2, tag: peer_2 Apr 02 10:28:46.116 DEBG Sending from 2 to 1, msg: msg_type: MsgAppendResponse to: 1 index: 6 commit: 6, to: 1, from: 2, raft_id: 2, tag: peer_2 Apr 02 10:28:46.116 DEBG Sending from 2 to 1, msg: msg_type: MsgAppendResponse to: 1 index: 7 commit: 7, to: 1, from: 2, raft_id: 2, tag: peer_2 Apr 02 10:28:46.116 DEBG Sending from 2 to 1, msg: msg_type: MsgAppendResponse to: 1 index: 8 commit: 7, to: 1, from: 2, raft_id: 2, tag: peer_2 Apr 02 10:28:46.116 DEBG Sending from 2 to 1, msg: msg_type: MsgAppendResponse to: 1 index: 8 commit: 8, to: 1, from: 2, raft_id: 2, tag: peer_2 Apr 02 10:28:46.116 DEBG persisted index 8, raft_id: 2, tag: peer_2 Apr 02 10:28:46.119 INFO switched to configuration, config: Configuration { voters: Configuration { incoming: Configuration { voters: {1} }, outgoing: Configuration { voters: {} } }, learners: {2}, learners_next: {}, auto_leave: false }, raft_id: 2, tag: peer_2 ```
BusyJay commented 2 years ago

If I understand you correctly, the restriction is only caused by the snapshot implementation details

Yes.

when several confchanges come one after another

In that case, I would suggest to use joint consensus to add several learners at once and wait till them catch up logs, promote them at once.