hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.94k stars 1.96k forks source link

Default to raft protocol 3, handling the upgrade from protocol 2 #7208

Closed langmartin closed 2 years ago

langmartin commented 4 years ago

Background

Raft protocol 3 is a requirement for autopilot, so we should default to it in 1.0. We should avoid putting too much effort into managing this protocol upgrade, it's unlikely to be repeated. We should put enough effort into it that upgrading to 1.0 should not be unusually manual or dangerous.

The Bug

  1. Follower B is stopped
  2. Follower C is stopped
  3. B' is restarted on version 1.0
  4. It's serf tags are gossiped to the leader A
  5. A removes B. The future returns when the configuration change is committed
  6. A adds B'
  7. A is stopped
  8. C' is restarted. C' only knows about servers {A, B} both of which are now gone

The Fix

The bug won't happen if a rolling upgrade waits until the previous membership change has been committed and then next upgrades a machine that has an up-to-date log. For simplicity of tooling, we could just wait until all cluster members have caught up before marking the cluster safe to upgrade.

There are two implementation options: provide operator tooling to allow the cluster administrator to safely perform a manual rolling upgrade, or handle the protocol upgrade internally so that no additional operator burden is imposed.

Operator Tooling

We could bundle a tool with 1.0 that checks the commit index of every machine in the cluster to determine if it's safe to perform the next upgrade. This could just be a query that hits every server in the cluster to ensure that the index changing the configuration has been accepted on all servers in the cluster. In an ordinary operator upgrade where the servers are updated with a rolling upgrade, this check will be true in less time than it takes for stats dashboards to catch up, so a typical operator would just experience a fairly simple but necessary check at every step of the upgrade process.

This check could be generalized to be a new requirement for the upgrade process, where it may be a good place to ship data version/consistency preflight checks in the future.

The main risk of operator tooling is that this 1.0 release, which communicates stability, will require a new manual upgrade step and could cause a cluster to need to be recovered if upgraded too quickly.

Automatic Upgrade

  1. upgrade all the servers to 1.0
  2. the leader waits until ServersMeetMinimumVersion shows that all servers are running 1.0
  3. the leader calls new UpgradeProtocol RPC on a follower
  4. the leader's raft library commits to raft a RemoveServer configuration change for that follower
  5. that follower receives the RPC and
    1. disconnects
    2. upgrades it's raft instance
    3. rejoins the cluster
  6. the leader adds the protocol 3 instance to the raft cluster
  7. the leader waits until follower has its log up to date
  8. the leader repeats 3-7 for all remaining protocol 2 followers
  9. the leader removes itself from the cluster and upgrades locally

More notes:

  1. the raft protocol 2 instance must leave the cluster first to avoid the illegal state of two raft servers having the same Addr
  2. the raft protocol 3 instance is added to the raft cluster with raft.AddStaging, but that method is a stub that just adds does an AddVoter (there's a todo comment in the raft libarary). Checking the instance's log index is the only way to ensure that it's caught up.
  3. waiting for follower to become integrated with the cluster before moving on to the next server is necessary to prevent RemoveServer from decreasing the quorum size. We want to keep the quorum size constant while upgrading.
github-actions[bot] commented 2 years ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.