rabbitmq / ra

A Raft implementation for Erlang and Elixir that strives to be efficient and make it easier to use multiple Raft clusters in a single system.
Other
813 stars 96 forks source link

Non-voters and automatic promotion #375

Closed illotum closed 1 year ago

illotum commented 1 year ago

This is a draft.

My take on #44, mostly a continuation of design discussed there. Some remaining (low-stakes) questions are left in italics, but they will not change much.

Notably, a departure from previous designs, the feature is completely opt-in. Both server start and member addition need be extended according to this example:

UId = ra:new_uid(ClusterName),
New = #{id => Id, membership => promotable, uid => UId},
ok = ra:start_server(default, ClusterName, New, add_machine(), [ServerRef]),
{ok, _, _} = ra:add_member(ServerRef, New),

Voter status

Voter status is stored in the cluster map of the server state and is part of every $ra_cluster_update. Additionally, nodes store their own status at the top level for ease of matching.

Nodes also store their own satus in ra_state ETS table (breaking change), and present in overview.

Effect on Raft

Voters behave as before.

Non-voters ignore election timeouts and do not attempt to vote for others. Consensus calculation takes this into account, and non-voters have no effect on quorum size.

Transitions

ra:start_cluster is unchanged, for new cluster of nonvoters will never elect a leader.

ra:add_member and ra:start_server now accept a map of server id and desired voter status. Legacy calls default to adding new server as a voter.

On every #append_entries_reply leader may choose to promote non-voter by issuing new $ra_join with desired voter status. Currently, only one promotion condition is implemented: non-voter will be promoted when it reaches the leaders log index at the time of joining.

kjnilsson commented 1 year ago

This looks very promising. Initial thought is that we shouldn't need a new command for this and that this should be the behaviour of all ra_join commands. New commands, although ok, can't be understood by old members if changes are made during a live upgrade so it is preferable to avoid adding new API unless absolutely necessary.

illotum commented 1 year ago

Update after talking to @kjnilsson elsewhere. Removes voter promotion from the log, instead it is based on a single static target the nonvoter has to reach. Non-voters behave like regular followers, but are treated differently:

  1. Leader doesn't count non voters when it computes required majority to raise commit index. This also means the newly-added members do not block cluster from progressing as is visible in certain tests I had to edit.
  2. Pre vote requests from non-voters are ignored.
  3. Pre vote and election responses from non voters are allowed (there's no way to tell who the ack is from). For this reason they also count towards required majority.

ra:member_overview reports a list of "voters" if requested from the leader. I kept ra_voters() type to allow for future extension. The tuple key, {matching, Target} is subject to discussion ofc.

luos commented 1 year ago

Thank you @illotum for working on this. Seems we will be able to add our stuff on top of it when time comes. :)

One thing not clear for me, is why is there a need for automatic promotion? For a RabbitMQ use case, I would imagine the benefits to be not that big as the cluster does not change very often?

illotum commented 1 year ago

@luos, at large enough scale nodes are lost regularly. I want to change the "quorum critical" report to acknowledge that the newly joined nodes are not ready, and restarting another member (out of three) will block quorum progress.


After talking to Karl I am reverting the "minimalistic" version of the feature. It is still incomplete, but will be continued with the original design of tracking voter status in the log.

illotum commented 1 year ago

Taking the WIP off. It is unlikely polished, but deserves a review.

Some notes and concerns:

Voter state is part of the Raft log, within the cluster membership record. It is also tracked at a higher level purely for convenience in handling RPCs. This is a fairly superficial improvement compared to the effort and risk of confusion. Totally willing to forgo it.

Non-voting followers do not attempt to get elected, and ignore calls for election. All done within ra_server limits. Certain election timeouts are still kicked off in ra_server_proc, which is a waste.

Their match_index is not included in commit computation, yet non-voters are listed in cluster members. This may be confusing to consumers, e.g. when newly added server doesn't block progress at first.

Voter status is reported in overview. My biggest concern, as RabbitMQ instead directly reads ra_state table when it reports quorum critical. Issuing overview RPCs feels prohibitively expensive in comparison.

Overall, most of the issues above would be addressed if I move this change into ra_server_proc and add new state(s). This comes with its own tradeoffs, e.g. code duplication with the follower state. I am prototyping this on a separate branch.


Updated transition diagram:

      ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ 
─────▷  voter => {no, Condition} │
      └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ 
                    │                                                  
                    │                                                  
                    ▽                    ▷ $ra_join (now can be used to update voter status)
      ┌──────────────────────────┐       ▶ initial_cluster_members                   
─────▶│       voter => yes       │
      └──────────────────────────┘                                     
kjnilsson commented 1 year ago

@illotum hi - I think there are both test and dialyzer failures but the checks pass as the tests aren't actually run atm. Would you mind rebasing from main which should pick up the relevant config to allow the PR ci suite to actually run.

michaelklishin commented 1 year ago

@illotum @kjnilsson and I seem to agree that using the "regular" (existing) Ra node identity for non-voting nodes should be fine. We can reuse it or even consider dropping the extra/new record field.

illotum commented 1 year ago

Rebased and pushed the uid change. It is one of those small things that cascade widely, but a sum PR diff became saner IMO. Notably, local state and ETS now only hold non_voter :: boolean(), and that's what various reporting tools expose.

It is briefly tested with ra-2.7.0 branch of rabbit. Will deal with dialyzer tomorrow.