hashicorp / raft-autopilot

Raft Autopilot
Mozilla Public License 2.0
21 stars 5 forks source link

Detect leader via the delegate #7

Closed vishalnayak closed 3 years ago

vishalnayak commented 3 years ago

In Vault, autopilot relying on leader's address to detect the ID is opening up a failure mode.

It is possible for raft config and the running nodes to have different addresses. Vault hasn't yet done the piece where the addresses in the raft config gets updated (possibly by re-adding the existing nodes with updated addresses). The main reason for this is that adding a node is not a straight forward ritual and requires unseal keys et al.

Anyways, if the customers restart the node with a different cluster address, since autopilot expects the addresses to match, autopilot will then start erroring out and state API skips returning some servers.

To get around it, the delegate is optionally made to return a IsLeader as part of known servers.

This doesn't affect Consul since the old style leader detection is still in place if the detection via the delegate fails.

mkeeler commented 3 years ago

On a meta note, if the raft config doesn't contain the updated address yet then how is raft working at all?

The addresses in the config are used to initiate replication, so it is possible that the leaders address doesn't have to be accurate but all the others will. You may want to consider then how the addresses can be kept in sync to prevent outages when nodes are restarted.

Also there is a distinction between general raft data which is stored via the LogStore and the raft configuration which is stored via the StableStore. I haven't thought it through much but does the stable store need to be guarded by the seal or would it be sufficient to only guard the log store?

vishalnayak commented 3 years ago

Currently when nodes are restarted, Vault expects the same address to be used for nodes that are in the raft config. This fix is for only when the addresses are attempted to be updated during a restart.