Closed langmartin closed 2 years ago
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Background
Raft protocol 3 is a requirement for autopilot, so we should default to it in 1.0. We should avoid putting too much effort into managing this protocol upgrade, it's unlikely to be repeated. We should put enough effort into it that upgrading to 1.0 should not be unusually manual or dangerous.
The Bug
B
is stoppedC
is stoppedB'
is restarted on version 1.0A
A
removesB
. The future returns when the configuration change is committedA
addsB'
A
is stoppedC'
is restarted.C'
only knows about servers{A, B}
both of which are now goneThe Fix
The bug won't happen if a rolling upgrade waits until the previous membership change has been committed and then next upgrades a machine that has an up-to-date log. For simplicity of tooling, we could just wait until all cluster members have caught up before marking the cluster safe to upgrade.
There are two implementation options: provide operator tooling to allow the cluster administrator to safely perform a manual rolling upgrade, or handle the protocol upgrade internally so that no additional operator burden is imposed.
Operator Tooling
We could bundle a tool with 1.0 that checks the commit index of every machine in the cluster to determine if it's safe to perform the next upgrade. This could just be a query that hits every server in the cluster to ensure that the index changing the configuration has been accepted on all servers in the cluster. In an ordinary operator upgrade where the servers are updated with a rolling upgrade, this check will be true in less time than it takes for stats dashboards to catch up, so a typical operator would just experience a fairly simple but necessary check at every step of the upgrade process.
This check could be generalized to be a new requirement for the upgrade process, where it may be a good place to ship data version/consistency preflight checks in the future.
The main risk of operator tooling is that this 1.0 release, which communicates stability, will require a new manual upgrade step and could cause a cluster to need to be recovered if upgraded too quickly.
Automatic Upgrade
ServersMeetMinimumVersion
shows that all servers are running 1.0UpgradeProtocol
RPC on a followerRemoveServer
configuration change for that followerMore notes:
Addr
raft.AddStaging
, but that method is a stub that just adds does anAddVoter
(there's a todo comment in the raft libarary). Checking the instance's log index is the only way to ensure that it's caught up.RemoveServer
from decreasing the quorum size. We want to keep the quorum size constant while upgrading.