[64pt] RAFT implementation in Tarantool (doc draft)

veod32 commented 4 years ago

RAFT implementation is to be released in Tarantool 2.6.1 release. Create draft version of documentation.

Gerold103 commented 3 years ago

Leader election and synchronous replication

In Tarantool both are implemented as a modification of Raft. Raft is an algorithm of synchronous replication and automatic leader election. Its complete description can be found here: https://raft.github.io/raft.pdf. In Tarantool synchronous replication and leader election are supported as 2 separate subsystems. So it is possible to get synchronous replication, but use something non-Raft for leader election. And vice versa - elect a leader in the cluster, but not use synchronous spaces at all. Synchronous replication has a separate documentation section. Leader election is described here.

Leader election

Automated leader election in Tarantool helps to guarantee that in a cluster there is at most one leader at any given moment of time. Leader is a writable node, and all other nodes are non-writable - they accept exclusively read-only requests. This can be useful when an application does not want to support master-master replication, and it is necessary to somehow ensure only one node will accept new transactions and commit them successfully.

When election is enabled life cycle of the cluster is divided into so called 'terms'. Each term is described by a monotonically growing number. Each node, after first boot, has it equal 1. When a node sees that it is not a leader, and there is no a leader available for some time, it increases the term, and starts new leader election round. Leader election happens via votes. Nodes, who started the election, vote for self, and send vote requests to other nodes. The ones, who got a vote request, vote for a first of them, and then can't do anything in the same term but wait for a leader being elected. If there is a node collected a quorum of votes, it becomes a leader, and notifies other nodes about that. Also a split-vote can happen, when no nodes got a quorum of votes. Then all the nodes, after a random timeout, bump the term again and start a new election round. Eventually a leader is elected. All the non-leader nodes are called 'followers'. The nodes, who start a new election round, are called 'candidates'. The elected leader sends heartbeats to the non-leader nodes to let them know it is alive. So if no heartbeats for too long time - new election is started. Terms and votes are persisted by each instance in order to preserve certain Raft guarantees.

During election the nodes prefer to vote for those who has the newest data. So as if an old leader managed to send something before death to a quorum of replicas, that data wouldn't be lost.

When election is enabled, it is required to have connections between each node pair, so as it would be a fullmesh. This is needed because election messages for voting and other internal things need direct connection between the nodes. Also if election is enabled on the node, it won't replicate from any nodes except the newest leader. This is done to avoid the issue, when a new leader is elected, but the old leader still somehow survived and tries to send more changes to the other nodes. Term numbers also work as a kind of a filter. You can be sure, that if election is enabled on 2 nodes, and node1 has term number less than node2, then node2 won't accept any transactions from node1.

Configuration

box.cfg({
    election_mode = <string>,
    election_timeout = <seconds>,
    replication_timeout = <seconds>,
    replication_synchro_quorum = <count>,
})

Leader election can be turned on by an option election_mode. Default is off, not active. All nodes, having this option != off, run Raft state machine internally, talking to other nodes according to the Raft leader election protocol. When the option is off, the node accepts Raft messages from other nodes, but it does not participate in the election activities, and it does not affect the node's state. So, for example, if a node is not a leader, but it has election_mode = 'off', it is writable anyway.

You can control which nodes can become a leader, if you want them participate in the election process, but don't want some of them to become leaders. For that use election_mode = 'voter'. When the mode is set to voter, the election works as usual, but this particular node won't become a leader (still will vote for other nodes). If the node should be able to become a leader, use election_mode = 'candidate'.

As it was mentioned, the election has a timeout, for the case of split-vote. The timeout can be configured using election_timeout option. Default is 5 seconds. It is quite big, and for most of the cases can be freely lowered to 300-400ms. It can be a floating point value (300 ms would be box.cfg{election_timeout = 0.3}. To avoid the split vote repeat, the timeout is randomized on each node on every new election, from 100% to 110% of the original timeout value. For example, if the timeout is 300ms, and there are 3 nodes started the election simultaneously in the same term, they can set their election timeouts to 300, 310, 320 respectively, or to 305, 302, 324, and so on. In that way the votes won't be split forever, because the election on different nodes won't be restarted simultaneously.

There are other options which affect leader election indirectly.

Heartbeats sent by an active leader have a timeout, after which a new election is started. Heartbeats are sent once per replication_timeout seconds. Default is 1. The leader is considered dead, if it didn't sent any heartbeats for replication_timeout seconds * 4.

You can also configure the election quorum. For that the election reuses the synchronous replication quorum: replication_synchro_quorum. Default is 1 meaning that each node becomes a leader immediately after it votes for self. It is best to set this option's value to the (cluster size / 2) + 1. Otherwise there is no a guarantee that there is only one leader at a time.

Besides, it is necessary to take into account, that being a leader is not the only requirement to be writable. A leader should have box.cfg{read_only = false}, and its connectivity quorum should be satisfied (box.cfg{replication_connect_quorum = <count>}) or disabled (box.cfg{replication_connect_quorum = 0}). Nothing prevents from setting box.cfg{read_only = true}, but the leader just won't be writable then. The option does not affect the election process though, so a read-only instance still can vote, become a leader.

Monitoring

To see the current state of the node regarding leader election there is box.info.election.

tarantool> box.info.election
---
- state: follower
  vote: 0
  leader: 0
  term: 1
...

It shows the node state, term, vote in the current term, and leader ID of the current term. IDs in the info output are the replica IDs visible in box.info.id output on each node and in _cluster space. 0 vote means the node didn't vote in the current term. 0 leader means the node does not know who is a leader in the current term. State can be follower, candidate, leader. When election is enabled, only in leader state the node is writable.

Election implementation based on Raft logs all its actions with 'RAFT:' prefix. Actions such as new Raft message handling, state change, vote, term bump, and so on.

Important notes to keep in mind

Leader election won't work properly if the election quorum is set <= cluster size / 2. Because in that case a split brain can happen, when 2 leaders are elected. For example, assume there were 5 nodes. When quorum is set to 2, node1 and node2 can both vote for node1. Node3 and node4 can both vote for node5. Node1 and node5 both win the election. When the quorum is set to the cluster majority, it won't ever happen.

That must be especially actual when add new nodes. If the majority value is going to change, better update the quorum on all the existing nodes before adding a new one.

Also the automated leader election won't bring many benefits in terms of data safety, when used without synchronous replication. Because if after a new leader is elected, the old leader still is active and thinks he is a leader, nothing stops it from accepting requests from the clients and making transactions. Non-synchronous transactions will be successfully committed, because they won't be checked against the quorum of replicas. Synchronous transactions will fail, because they won't be able to collect the quorum - most of the replicas will reject these old leader's transactions, because it is not a leader anymore.

Another issue to remember is that when a new leader is elected, it won't automatically finalize synchronous transactions left from the previous leader. That must be done manually using box.ctl.clear_synchro_queue() function. In future it is going to be done automatically.

Gerold103 commented 3 years ago

I updated the comment with a new option election_mode, which substituted the old options election_is_enabled and election_is_candidate.

veod32 commented 3 years ago

https://github.com/tarantool/doc/tree/veod32-gh-1541 (for reference)

veod32 commented 3 years ago

Follow-up https://github.com/tarantool/doc/issues/1652

tarantool / doc