We've been seeing some flakiness around node promotion demotion, so this writeup of how it works, along with some possible issues.

Node promotion

When a node is promoted, the desired state of the node is set to manager in the control api.

The roleManager (github.com/docker/swarmkit/manager/role_manager.go) is a service running on the leader which watches for updates to nodes, and reconciles the desired role with the observed role. When it has gotten an update about a promotion, it just automatically updates the a node's observed role to manager, if the node's desired or observed roles or the node's existence at all hasn't already changed in the meantime.

However, the raft membership isn't updated yet. The node is added to the raft cluster membership when it makes a call to another manager node's raft API (which gets forwarded to the leader) and requests to join. The leader will:

observe the new node's IP address
attempt to contact it back, otherwise not bother adding it to raft
generate a raft ID for it, one that is guaranteed to be unique
send a conf change to all the other nodes adding the new raft member
send the confirmation back to the original joining node, with the raft ID

We don't want to pre-add the raft node to the cluster, because:

we may not know the originating IP
there may be a configuration issue that will prevent it from being able to join, and automatically adding it may break quorum if so
we need to make sure that the raft ID has never been used before

Node demotion

When a node is demoted, before we allow a demotion we do a couple sanity checks:

the last manager of the cluster cannot be demoted
we won't demote a node that's not in the raft memberlist
we won't demote if doing so would cause a loss of quorum

The roleManager, when it gets an update about a demotion, will attempt to reconcile the role in the following manner:

if it's not a member of the raft cluster, or was successfully removed, the node is updated to reflect "worker" as the observed role
if it's a member of the raft cluster, we do not remove immediately if doing so would break quorum
if the leader is being demoted, it attempts to transfer leadership so the new leader can apply the conf change

Node removal

When a node is removed, we do some sanity checks before we remove it:

if its desired role is a manager, and it's a member of the raft cluster, we require that it be demoted before removal. If it's not a member of the raft cluster, we allow removal
if the node is down, we refuse to remove it unless the force flag is supplied

Once a node is removed, its node ID is added to a list of blacklisted certs - no node with this ID will be allowed to connect with another manager again (we have a custom GRPC authorizer that checks to see if the node ID is in the blacklist). If the cluster has a stored certificate for this node, an expiry date is added so that the list of blacklisted certs can be cleaned up after the last valid cert for the node expires. It's a blacklist rather than a whitelist because if a blacklist fails to propagate in time, a node will, for a time, be able to connect to a cluster when it shouldn't. If the whitelist fails to propagate in time, a node won't be able to connect when it should, and this could destabilize the cluster.

Renewing the TLS certs/manager startup/shutdown after a demotion or promotion

When a node's desired role changes (or the certificate status is IssuanceStateRotate), the dispatcher pushes node changes down to the agent. The agent, upon seeing a node change, including the desired role change, will renew the certificate, and keep trying to renew until it gets a certificate with the expected (desired) role.

The CA server running on the leader only issues certs for the observed node role (so if the node role hasn't been reconciled yet, the CA server will issue a cert for the previous desired node role).

When a node gets a cert for a new role, only then does it officially either start up the manager if it's been promoted, or shut down the manager if it's been demoted. This makes sense for promotions, because there is no point in starting up manager services without a manager cert, since the node will not be able to perform its role as a manager without the manager cert.

Edge cases

@anshulpundir found this in https://github.com/docker/swarmkit/issues/2558: If a node is promoted, but never joins the raft cluster (perhaps it goes down before being able to do so), then it can't ever be demoted. But it can be (force-)removed.
Documented in https://github.com/docker/swarmkit/issues/2548: If a node is demoted, and then immediately removed before the node reconciler can remove it from the raft cluster, and then there's a leader election, the raft node is never removed from the raft cluster and can cause quorum issues.
If node was promoted, then immediately removed before the raft join conf change goes through (which we allow), then it's possible that the raft node gets added to the cluster and never removed. The node itself will be unable to contact the cluster since it will be blacklisted.

Manager quick demotion-promotion

This is I think not an edge case, but it is involved and kinda quirky, and if I'm wrong it's possible the manager will be removed from the cluster but try to rejoin with the same raft ID and that could cause issues where the cluster thinks there might be 2 leaders (see https://aphyr.com/posts/330-jepsen-rethinkdb-2-2-3-reconfiguration).

If a manager node is demoted and then promoted before the reconciler can successfully demote (possibly due to quorum issues), the reconciler should see that the node's desired state matches it's current state, and do nothing (not demote). This all works because there is a single event loop, so two reconciliations can't happen at the same time (assuming there's a single role manager running at any given time).

If the reconciler managed to remove the node from the raft consensus, but hadn't gotten around to updating the observed role yet, it will never update the observed role. The node will probably stop trying to get a new cert. The node's raft node will detect the conf change about it being removed from raft, and the manager will shut down with a error that the node was removed from raft, and wipe out all data.

The github.com/docker/swarmkit/node/node.go's superviseManager function try to get a worker cert, because the manager was evicted, but will time out after a little while because it will fail to get a worker cert, and then restart the manager.

Worker quick promotion-demotion

A manager can't be demoted if it's not part of the raft cluster, so the demotion would fail until it successfully joined the raft cluster, at which point the demotion logic should kick in.

I think edge cases (2) and (3) can be solved by https://github.com/docker/swarmkit/pull/2551 if I add watching for node deletion events. I don't think updating the removal logic to disallow removal of a manager node before demotion has completed will fix (3), although it will prevent (2), because we can't tell if a node has completed demotion or if it has just failed to complete promotion.

I think edge case 3 should not a possible scenario.

If node was promoted, then immediately removed before the raft join conf change goes through (which we allow), then it's possible that the raft node gets added to the cluster and never removed. The node itself will be unable to contact the cluster since it will be blacklisted.

I think this will not happen since in RemoveNode, we check for the desired state as NodeRoleManager. So, we'll fail that node remove if the node was promoted, but the promotion has not completed yet.

I think this will not happen since in RemoveNode, we check for the desired state as NodeRoleManager. So, we'll fail that node remove if the node was promoted, but the promotion has not completed yet.

From IRL discussion with @anshulpundir, it will fail to remove the desired state is NodeRoleManager and if it's a member of the raft cluster. It's not a member of the cluster yet (promotion not yet completed), the removal is allowed to go through.

Maybe there is room simplify this workflow ? Some questions/ideas:

can we get rid of the desired state to simplify this workflow ?
In demotion: if it's a member of the raft cluster, we do not remove immediately if doing so would break quorum

In this case, we can just fail the demote instead of keep retrying.

Basically, I think we should try to reduce the number of invariants as much as possible by disallowing operations until reconciliation is complete. Thoughts ?

From our previous IRL discussion, I mentioned that the desired role as added later - I misremembered actually, it was the observed role that was added later: https://github.com/docker/swarmkit/pull/1829. The desired role has been there since near the beginning I think: https://github.com/docker/swarmkit/pull/690.

can we get rid of the desired state to simplify this workflow

We'd still have to provide backwards compatibility, so while we can delete it for future versions, we'd still have to support checking it and reconciling it. Also, removing it I think breaks the pattern for everything else in swarm of having the spec be the desired state that the user would like, and the rest of the object storing the current state of the object. Possibly if we wanted to remove one state though, it'd be the observed state.

In this case, we can just fail the demote instead of keep retrying.

We do fail the demote if we can detect if it breaks quorum - there's a quorum check in the control API, and another in the reconciliation loop. https://github.com/docker/swarmkit/pull/1829 mentions that the GRPC call for demote can happen concurrently, meaning that both calls can happen at the same time, both will check to see if demoting will break quorum - both those checks will pass (allowing demotion), and then demoting 2 nodes will break quorum. As per IRL discussion just a second ago though, you mentioned that maybe we could just grab a lock to serialize demotion calls, which could fix this issue.

It looks like previously, prior to https://github.com/docker/swarmkit/pull/1829, the DesiredRole was updated and then the raft member was removed, and that could fail, leaving the cluster in a weird state where it was part of the raft cluster, but in a worker role. I'm wondering if we can just reverse the order of operations - remove the raft member, then "demote". If the leader dies in between the two operations, it's possible this could leave the node again in a weird state where it's been booted out of the cluster, but I think it might be able to recover because it will go into the same logic of the rapid demote-promote mentioned above. That logic flow is also complicated though, and probably not very well tested.

Ah wait, we definitely will hit the failure case I think when demoting the current leader. When we demote the leader, we transfer leadership, and once that's done we won't be able to write the store update to change the role, which means that users will effectively have to demote a node twice if it's the leader. The mismatch between the desired vs observed state I guess then functions sort of as a message to the next leader to say "hey, finish demoting me now".

can we get rid of the desired state to simplify this workflow ?

I agree that the workflow would be simpler to understand as a serial operation, and I will continue to think about it, but at the moment I'm not sure we can get cluster membership correct that way either. The raft cluster membership cannot be immediately synced with the node list on control API changes - they're 2 different things, so they cannot be changed atomically, so we have to account for a failure in the middle of the two changes causing inconsistent state. We can try to roll back, but rollback has the same issue. Retrying to converge to a consistent view of the world seems to be the best strategy; we'd have to work through all of these cases anyway if we remove the reconciliation loop, and advise the user what to do in each case (because they would have to be the ones that would retry, or have to re-initialize the cluster if they accidentally break quorum).

Basically, I think we should try to reduce the number of invariants as much as possible by disallowing operations until reconciliation is complete. Thoughts ?

Just thinking through all the cases here out loud:

Disallow demote while promote is still in progress: promotion happens immediately, but joining the raft cluster does not necessarily happen immediately. We already disallow demotion unless the node is a member of the raft cluster, so this case is covered I think.
Disallow promote while demote is still in progress: we do zero checks when promoting. This allows us to get into the "Manager quick demotion-promotion" scenario, which I think should be correct, but it is really complex to understand as it involves many components all happening to interact in a good way, and could have pretty bad failure consequences. I think it would be good to disallow promote until the observed role becomes "worker", and reconciliation is done. Possibly, to be absolutely safe, we should add an additional state in the "promotion/demotion" workflow; currently there is desired state, reconciled state, which I've been calling observed state, but it's not really observed. We don't actually record the role of the certificate the node is currently acting as. I actually consider a demotion unfinished until the node gets a new cert, or reports the role it is acting as.
Disallow remove while a promotion is in progress: If we disallowed this, then we can't recover from the scenario where a node was promoted and reconciliation finished, but the node could not connect. Because we can't demote that node, either. So that node either has to come online, or it will be stuck in the cluster forever. Maybe we can change (1) instead, but one of the two has to be allowed to account for this failure case. Maybe we require --force in order to remove if the promotion is in progress? Or we require waiting until the node is reported as being down first? Not entirely sure about this.
Disallow remove while a demotion is in progress: currently, we don't remove a node from the raft membership if doing so would break quorum. However, if we allow a node remove to happen before it's removed from the raft membership, then its node ID can get blacklisted and it won't be able to function as a manager, thus breaking quorum anyway. So I think disallowing the remove while a demotion is in progress is a probably a good idea:
- If the node is down already, it shouldn't be counted as a member of the quorum - removing it would not affect quorum, so the fact that it's down should not be affecting whether it can be removed.
- If the node is up, and removing it will break quorum, we should allow it to continue acting as a manager until some other node comes back online or the user intercedes in some other manner to ensure quorum.

Possibly we can provide better UI or introspection into the promote/demote progress - currently users promote or demote a node, and we have no indication that a promotion or demotion is in progress, or what might be blocking it. Users just demote a node, but don't see that anything is happening.

Discussion with @anshulpundir IRL:

We probably cannot make role reconciliation atomic or serial - there needs to be a reconciliation loop. But the reconciliation loop is small and self contained and easy to understand in and of itself, so long as you understand the difference between the desired and observed roles, and not terrible to test in isolation, so it's probably not worth trying to eliminate.
We need better abstraction and documentation around the demotion and promotion workflows - something that encapsulates the state changes (e.g. demotion started, demotion in progress, demotion finished), which wraps the desired/observed states/raft membership. We also need to surface this information in the API better.
We may still need the "observed role", which is the role that the CA uses to issue certificates.
- when a demotion happens, the node needs to first be removed from the raft cluster before it can be demoted. So while the desired state is a worker, the CA should not issue the node a worker certificate, otherwise the node will not be able to connect to the raft cluster anymore, effectively negating us trying to cleanly remove it from raft first before demoting
- when a promotion happens, the node needs to be able to get a manager certificate immediately, otherwise it will not be able to connect to the raft cluster and complete the promotion Perhaps this field should be named "CertificateRole" instead or something, to indicate that this is the role that the CA should be using to issue certificates. The CA could, instead, synthesize this information from the membership lists + desired state: e.g. desired state = manager -> the CA should issue a manager cert, but desired state = worker and node is in raft membership -> CA should not issue a worker cert. However, it may be better to encapsulate a node's role lifecycle in the role manager code, rather than spread it across the CA and the role manager.
We may need to add another field like RoleChangeInProgress to indicate that a node is in the process of being demoted or promoted. The managers know whether a node is in the middle of a change, because they can inspect the raft membership, but it would be good to surface this information up to the user, and this would be way of doing so without returning both the node lists and the membership lists on a node ls. When the desired node state is changed, this field should also be set to true, and the role manager can set this field to false when an operation is done. This may mean (if we want "promotion" to not be finished until a node has joined the raft cluster) that the role manager will now also need to watch for changes in the raft membership.
We should disable promote while demote is still in progress, and disallow remove while demotion is in progress, as per the comment above. We'll return a GRPC error, the engine API should return an error, and the docker CLI should attempt to demote, and then poll node inspect info until RoleChangeInProgress is false.

Do the issues shown in this comment potentially relate to the race conditions mentioned, particularly around demotion and removal?

moby / swarmkit

Node Promotion/Demotion workflow review #2565