Closed illotum closed 1 year ago
This looks very promising. Initial thought is that we shouldn't need a new command for this and that this should be the behaviour of all ra_join
commands. New commands, although ok, can't be understood by old members if changes are made during a live upgrade so it is preferable to avoid adding new API unless absolutely necessary.
Update after talking to @kjnilsson elsewhere. Removes voter promotion from the log, instead it is based on a single static target the nonvoter has to reach. Non-voters behave like regular followers, but are treated differently:
ra:member_overview
reports a list of "voters" if requested from the leader. I kept ra_voters() type to allow for future extension. The tuple key, {matching, Target}
is subject to discussion ofc.
Thank you @illotum for working on this. Seems we will be able to add our stuff on top of it when time comes. :)
One thing not clear for me, is why is there a need for automatic promotion? For a RabbitMQ use case, I would imagine the benefits to be not that big as the cluster does not change very often?
@luos, at large enough scale nodes are lost regularly. I want to change the "quorum critical" report to acknowledge that the newly joined nodes are not ready, and restarting another member (out of three) will block quorum progress.
After talking to Karl I am reverting the "minimalistic" version of the feature. It is still incomplete, but will be continued with the original design of tracking voter status in the log.
Taking the WIP off. It is unlikely polished, but deserves a review.
Some notes and concerns:
Voter state is part of the Raft log, within the cluster membership record. It is also tracked at a higher level purely for convenience in handling RPCs. This is a fairly superficial improvement compared to the effort and risk of confusion. Totally willing to forgo it.
Non-voting followers do not attempt to get elected, and ignore calls for election. All done within ra_server
limits. Certain election timeouts are still kicked off in ra_server_proc
, which is a waste.
Their match_index is not included in commit computation, yet non-voters are listed in cluster members. This may be confusing to consumers, e.g. when newly added server doesn't block progress at first.
Voter status is reported in overview
. My biggest concern, as RabbitMQ instead directly reads ra_state
table when it reports quorum critical. Issuing overview RPCs feels prohibitively expensive in comparison.
Overall, most of the issues above would be addressed if I move this change into ra_server_proc
and add new state(s). This comes with its own tradeoffs, e.g. code duplication with the follower state. I am prototyping this on a separate branch.
Updated transition diagram:
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
─────▷ voter => {no, Condition} │
└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
│
│
▽ ▷ $ra_join (now can be used to update voter status)
┌──────────────────────────┐ ▶ initial_cluster_members
─────▶│ voter => yes │
└──────────────────────────┘
@illotum hi - I think there are both test and dialyzer failures but the checks pass as the tests aren't actually run atm. Would you mind rebasing from main which should pick up the relevant config to allow the PR ci suite to actually run.
@illotum @kjnilsson and I seem to agree that using the "regular" (existing) Ra node identity for non-voting nodes should be fine. We can reuse it or even consider dropping the extra/new record field.
Rebased and pushed the uid change. It is one of those small things that cascade widely, but a sum PR diff became saner IMO. Notably, local state and ETS now only hold non_voter :: boolean()
, and that's what various reporting tools expose.
It is briefly tested with ra-2.7.0 branch of rabbit. Will deal with dialyzer tomorrow.
This is a draft.My take on #44, mostly a continuation of design discussed there. Some remaining (low-stakes) questions are left in italics, but they will not change much.
Notably, a departure from previous designs, the feature is completely opt-in. Both server start and member addition need be extended according to this example:
Voter status
Voter status is stored in the cluster map of the server state and is part of every
$ra_cluster_update
. Additionally, nodes store their own status at the top level for ease of matching.Nodes also store their own satus in
ra_state
ETS table (breaking change), and present in overview.Effect on Raft
Voters behave as before.
Non-voters ignore election timeouts and do not attempt to vote for others. Consensus calculation takes this into account, and non-voters have no effect on quorum size.
Transitions
ra:start_cluster
is unchanged, for new cluster of nonvoters will never elect a leader.ra:add_member
andra:start_server
now accept a map of server id and desired voter status. Legacy calls default to adding new server as a voter.On every
#append_entries_reply
leader may choose to promote non-voter by issuing new$ra_join
with desired voter status. Currently, only one promotion condition is implemented: non-voter will be promoted when it reaches the leaders log index at the time of joining.