In this lab you'll implement Raft as a Go object type with associated methods, meant to be used as a module in a larger service. A set of Raft instances talk to each other with RPC to maintain replicated logs. Your Raft interface will support an indefinite sequence of numbered commands, also called log entries. The entries are numbered with index numbers. The log entry with a given index will eventually be committed. At that point, your Raft should send the log entry to the larger service for it to execute.

Follow the design in https://pdos.csail.mit.edu/6.824/papers/raft-extended.pdf, pay particular attention to Figure 2. Will not implement "cluster membership changes" (chapter 6).

Resources:

Guide to raft, part 2D has changed since this blog post was written
Raft locking advice
Raft structure advice
Paxos, Chubby, Paxos Made Live, Spanner, Zookeeper, Harp, Viewstamped Replication
http://static.usenix.org/event/nsdi11/tech/full_papers/Bolosky.pdf
Diagram of Raft interaction
https://blog.josejg.com/debugging-pretty/

Part 2A

Implement Raft leader election and heartbeats (AppendEntries RPCs with no log entries). The goal for Part 2A is for a single leader to be elected, for the leader to remain the leader if there are no failures, and for a new leader to take over if the old leader fails or if packets to/from the old leader are lost

Leader election

Based on the rules for candidates in fig 2, a candidate should send vote requests and reset the election time out in parallel

After getting elected, leader should send heartbeat messages immediately

Scenarios

Suppose that there are 5 servers in the cluster, all of them start out as follower

4 followers, 1 candidate

The candidate obtains the majority vote and becomes the leader. The new leader then sends heartbeat messages to the followers.

3 followers, 2 candidates

1 candidate beats the other to become the leader

Happy path: The leader sends heartbeat messages to all servers, and they all receive the message. The other candidate will become a follower after returning the heartbeat message.

Heartbeat from leader fails to reach the other candidate: When the candidate's election timeout period elapses, it will start the election process again. Let's say that the term that each server has looks like:

Leader: 1
Candidate: 1
Followers: 1

When the candidate starts the election process again, its term will be come 2 and sends out vote requests. Before all servers reply with the vote response, the candidate receives a heartbeat message from the current leader. The candidate's term is now 2, while the leader's term is 1. What should happen in this case?

Should the leader become a follower after receiving the candidate's response? <—- paper says that this should happen (end of section 5.1)
Should the candidate convert to a follower?

Also, how should other servers (not candidate) process the vote requests?

The paper says that each server can vote for at most 1 candidate in a term. So these servers already voted for the current leader in the current term. But their current term is 1, while the candidate's term is 2. If the followers update their term, they are technically in a new term. Can they then vote for the candidate? Also, should the followers update their terms before or after they decide whether to vote for a candidate?

3 followers, 1 candidate, 1 leader

AppendEntries

How do we define the idle period where heartbeat messages should be sent?

Part 2B

Read sections 5.3, 5.4 and 5.5 of the paper.

[x] Implement log indices in a way that is not dependent on array indices
[x] Double-check logic that check candidate is at least up-to-date during RequestVotes
[x] Make heartbeat messages replicate logs when needed
[x] Implement replicating logs through AppendEntries
[x] Implement committing log messages
[x] Send message through ApplyChan when log is committed
[x] Implement optimization while resolving log inconsistencies
[x] Add more logging and basic request tracing
[x] Try to make some test cases complete more quickly
[ ] Audit for edge cases and race conditions. Think through what happens in each case. Also think about, performance, reliability and consistency in these cases

When leader receives request from client

appends the command to its log as a new entry
issues AppendEntries RPCs in parallel to each of the other servers to replicate the entry
When the entry has been safely replicated, the leader applies the entry to its state machine and returns the result of that execution to the client
If followers crash or run slowly, or if network packets are lost, the leader retries AppendEntries RPCs indefinitely (even after it has responded to the client) until all followers eventually store all log entries
A log entry is committed once the leader that created the entry has replicated it on a majority of the servers
- This also commits all preceding entries in the leader’s log, including entries created by previous leaders
- Once a follower learns that a log entry is committed, it applies the entry to its local state machine (in log order)

Question

Say we have the following scenario:

The leader is retrying an AppendEntry request to a follower
Client sends new command, so leader sends new AppendEntry request to followers
This new AppendEntry request is successful on all followers. This means that the log entry that the leader was trying to send in step 1 has been committed on all followers

What should happen to the request in step 1 now? Should it stop? Should the follower no-op while handling it?

Changes to AppendEntries

The leader keeps track of the highest index it knows to be committed, and it includes that index in future AppendEntries RPCs (including heartbeats) so that the other servers eventually find out

When sending an AppendEntries RPC, the leader includes the index and term of the entry in its log that immediately precedes the new entries. If the follower does not find an entry in its log with the same index and term, then it refuses the new entries.

When handling AppendEntries

prevLogIndex: index of log entry immediately before the new entries prevLogTerm: term of the log at prevLogIndex on the leader/candidate

log indices start at 1

Return false if the server's log at prevLogIndex does not have the same term as prevLogTerm.

What are the possible scenarios?

When the very first leader gets elected

prevLogIndex: 0 - the very first log would have index 1, so prevLogIndex should be 0 since there's no logs yet prevLogTerm: 0 - the term starts off at 0 on boot, and no log should have been added yet

If the server doesn't have any logs yet, then continue processing.

If the server has any logs, return false.

When a leader is elected halfway through

prevLogIndex: n prevLogTerm: t

If prevLogIndex < len(server's logs), the current server must have a log at that index. Compare prevLogTerm with server's logs[prevLogIndex-1] (prevLogIndex-1 because indices start from 1).

If prevLogIndex == len(server's logs), same as above.

If prevLogIndex > len(server's logs), server cannot contain any log at prevLogIndex, so return false.

Handling inconsistencies

However, leader crashes can leave the logs inconsistent (the old leader may not have fully replicated all of the entries in its log

In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log

To bring a follower’s log into consistency with its own, the leader must:

find the latest log entry where the two logs agree
delete any entries in the follower’s log after that point
send the follower all of the leader’s entries after that point All of these actions happen in response to the consistency check performed by AppendEntriesRPCs.
The leader maintains a nextIndex for each follower
When a leader first comes to power, it initializes all nextIndex values to the index just after the last one in its log
If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC
After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match.
When this happens,AppendEntrieswill succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any).

Leader completeness property

The RequestVote RPC implements this restriction: the RPC includes information about the candidate’s log, and the voter denies its vote if its own log is more up-to-date than that of the candidate.

Raft determines which of two logs is more up-to-date by comparing the index and term of the last entries in the logs. If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more up-to-date.

Committing entries from previous terms

If a leader crashes before committing an entry, future leaders will attempt to finish replicating the entry.

A leader cannot immediately conclude that an entry from a previous term is committed once it is stored on a majority of servers.

Raft never commits log entries from previous terms by counting replicas. Only log entries from the leader’s current term are committed by counting replicas; once an entry from the current term has been committed in this way, then all prior entries are committed indirectly because of the Log Matching Property.

Part 2D

Your Raft must provide the following function that the service can call with a serialized snapshot of its state:

Snapshot(index int, snapshot []byte)

In Lab 2D, the tester calls Snapshot() periodically. In Lab 3, you will write a key/value server that calls Snapshot(); the snapshot will contain the complete table of key/value pairs. The service layer calls Snapshot() on every peer (not just on the leader).

The index argument indicates the highest log entry that's reflected in the snapshot. Raft should discard its log entries before that point. You'll need to revise your Raft code to operate while storing only the tail of the log.

You'll need to implement the InstallSnapshot RPC discussed in the paper that allows a Raft leader to tell a lagging Raft peer to replace its state with a snapshot. You will likely need to think through how InstallSnapshot should interact with the state and rules in Figure 2.

When a follower's Raft code receives an InstallSnapshot RPC, it can use the applyCh to send the snapshot to the service in an ApplyMsg. The ApplyMsg struct definition already contains the fields you will need (and which the tester expects). Take care that these snapshots only advance the service's state, and don't cause it to move backwards.

If a server crashes, it must restart from persisted data. Your Raft should persist both Raft state and the corresponding snapshot. Use the second argument to persister.Save() to save the snapshot. If there's no snapshot, pass nil as the second argument.

When a server restarts, the application layer reads the persisted snapshot and restores its saved state.

A "snapshot" will be considered one of the persisted states of the server -> Need to include it when writing and reading states from disk.

For Snapshot(index int, snapshot []byte)

[x] Create a snapshot that contains
- The snapshot slice
- Last included index (the given index)
- Last included term
[x] Deletes all log entries up to and including index
[x] Persists the server's state to disk
[ ] To avoid blocking other operations that write to the logs while a snapshot is being taken. Use copy-on-write techniques. The paper's implementation creates an in-memory snapshot of an entire state machine.

Think about snapshots changes Figure 2

What if the entries compacted to a snapshot have not been committed yet? How do we commit them?

For InstallSnapshot RPC

When a follower's Raft code receives an InstallSnapshot RPC, it can use the applyCh to send the snapshot to the service in an ApplyMsg.
Take care that these snapshots only advance the service's state, and don't cause it to move backwards.

jamestran201 / mit-distributed-systems-labs

Lab 2: Raft #2