How to bootstrap a raft cluster with a fixed node setting?

leiless commented 1 year ago

Hi, All. I'm new to raft.

In my use case, I use a fixed node setting in a config file:

[raft]
bind_address = "localhost:11000"
nodes = [
    "localhost:11000",
    "localhost:11001",
]

All raft nodes will have the same nodes config, so they can start voting as soon as they start running. So if I have three nodes (A, B, and C), and their config should have the same nodes settings in the config file (i.e. nodes = ["A", "B", "C"]).

And the second time I bootstrapped the cluster, I got the bootstrap only works on new clusters error message. I know that the bootstrap should be done once, but how can I support the fixed raft nodes config?

For example, If I change the above config into

[raft]
bind_address = "localhost:11000"
nodes = [
    "localhost:11000",
    "localhost:11002", # Changed
]

In the next raft run, the raft should ask for 11000 and 11002 for data exchange, the 11001 should automatically be removed from raft.Configuration. You can think of the raft nodes config is tied to the raft.Configuration.

banks commented 1 year ago

Hey @leiless!

The bad news is that in general it's not possible to do what you asked for correctly with Raft. The good news though is that that is because what you asked for would probably lead to violating all of Raft's consistency guarantees so it's probably good that it's not easy to do!

The configuration of the set of servers in any consensus protocol is a core part of the correctness of the algorithm because they all depend on all correct servers agreeing on whether "consensus" has been reached by a "quorum" of servers, and that is only possible if all the servers agree on how many (and specifically which) other nodes are part of the group making decisions.

That's why you can bootstrap raft only once. After that, the raft.Configuration becomes part of the replicated state itself so that all peers agree. The Raft paper and algorithm actually specifies exactly how this configuration change must happen - for example you can only add or remove a singe node at a time etc. In this library, you can do that by calling AddVoter and RemoveServer on a running raft instance (specifically on the current leader).

If you just want a way to start over with your cluster in testing/dev the simplest way is to just delete the raft state each time. If you want a way to "reset". the cluster members but without loosing the existing data then you can use RecoverCluster to do this - read all the warning carefully though as this is not part of the Raft protocol and violates any correctness/durability guarantees. It's meant only as an emergency procedure if your cluster has lost quorum entirely and you just want to do the best you can with the one server you have left by rebuilding a new cluster but keeping the existing data on that one server.

For any production raft system, you'll need to handle adding and removing nodes online rather than just in a static config file for the reasons described above.

You just about could make it work by checking on startup if you need to bootstrap or not (using raft.HasExistingState) and then if not, start raft normally, then check whether the on-disk configuration of peers matches the raft configuration (see GetConfiguration) and if not then "automatically" do the steps needed to migrate, but you'd have to work out the safest way to do that, for example you'd need to add and remove nodes in a sequence that maintains whatever guarantees are important to you. And this would only work on the leader so you'd need to wait for a leader to be elected and then have them reconcile the config to raft in the background or something (this is more or less how Consul works using Serf gossip information to decide if the set of servers is different to before).

Changing Address Only

One extra wrinkle here is that if you want to change the network address of a peer but without changing it's identity (for example if a VM has be rebooted and has the same disk state as before but for some reason got a new IP address) then that is possible - you need to specify a UUID or similar that is stored on disk (outside of raft e.g. in config) on each server and pass that as the LocalID in the node's raft.Config. You'd also need to set the UUIDs of the other servers as well as their address in the raft.Configuration you use to bootstrap. Now if a node changes IP address, you do still have call AddVoter with the new address and old ID, and it still is a raft configuration change before the leader can start talking to that node on the new IP again (i.e. a quorum of other nodes needs to be reachable), but raft will realise that it's not a brand new node since it has the same ID as before and so will still treat the cluster as a cluster of 3 etc so there is no need to delete the old entry.

I know that's a lot to take in but I hope this helped!

leiless commented 1 year ago

Got it! Thanks for your elaborate explanation!

leiless commented 1 year ago

And this would only work on the leader so you'd need to wait for a leader to be elected and then have them reconcile the config to raft in the background or something

@banks How could I know a certain node is a leader or not on raft startup with a blocking IO(e.g., unbuffered chan)?

ncabatoff commented 1 year ago

And this would only work on the leader so you'd need to wait for a leader to be elected and then have them reconcile the config to raft in the background or something

@banks How could I know a certain node is a leader or not on raft startup with a blocking IO(e.g., unbuffered chan)?

If I understand your question correctly (you want a blocking call that will return the leader once one is elected), you can use LeaderCh to determine whether a given node becomes leader. Alternatively you could poll LeaderWithID to discover when an arbitrary node gains leadership.

otoolep commented 1 year ago

This is an interesting discussion, I get a similar question about rqlite every so often, which is built on this Raft code. Let me check that the advice I give my users is correct:

changing one node's network identifier (but LocalID stays the same) is easy. Just restart the node, and have it rejoin the cluster as a voter. What rqlite actually does if it sees a node join with a node ID already in the config is first remove the node from the config, then add the node again.
what confuses folks is that if they shutdown every node in their cluster, and then start every node again -- but each node comes up with a different network address -- they are stuck. They have no choice but to use RecoverCluster if they want to use the new network addresses.

This is all correct right? If all the network addresses change, even if the node IDs do not, RecoverCluster is the only option?

I think folks get confused because one node can be restarted with a new network address easily enough, but changing all at once is not possible.

wolfinch commented 1 year ago

This is an interesting discussion, I get a similar question about rqlite every so often, which is built on this Raft code. Let me check that the advice I give my users is correct:

changing one node's network identifier (but LocalID stays the same) is easy. Just restart the node, and have it rejoin the cluster as a voter. What rqlite actually does if it sees a node join with a node ID already in the config is first remove the node from the config, then add the node again.

what confuses folks is that if they shutdown every node in their cluster, and then start every node again -- but each node comes up with a different network address -- they are stuck. They have no choice but to use RecoverCluster if they want to use the new network addresses.

This is all correct right? If all the network addresses change, even if the node IDs do not, RecoverCluster is the only option?

I think folks get confused because one node can be restarted with a new network address easily enough, but changing all at once is not possible.

How do you currently support the cases where, the cluster is deployed in k8s and the cluster restarts ? I am currently building a similar solution and wondering how to handle the cluster restart scenarios where pods come back with different addresses ?

banks commented 1 year ago

This is all correct right? If all the network addresses change, even if the node IDs do not, RecoverCluster is the only option?

That is essentially correct with the default NetworkTransport/TCPTransport in this library yes. This is because the leader will dial the address in it's config and you can't change that config if the leader can't dial the followers in the config in order to commit the change!

That said, I think it would be possible if you implement a custom StreamLayer for the NetworkTransport for your application to re-map the addresses. i.e. if you application knows somehow that node with UUID now has IP address 10.10.10.10 but the config says 10.20.20.20, you could implement a custom dialler that when it sees the address 10.20.20.20 actually dials 10.10.10.10. That would allow the nodes to start communication again based on the current configuration even though all their addresses changed. If this works (I've not tested but I don't see why it wouldn't) then you could immediately call AddVoter for each member to reset the addresses in config.

Alternatively, you could choose to configure Raft with "virtual" addresses from the start that never change in the raft config and map 1:1 with node's UUIDs, and then you could have a custom StreamLayer that knows how to discover the actual IP address/port to dial for a given node and never have to change them. I don't believe there is even any hard requirement in Raft that the "address" string be a valid IP address so you could treat this "address" as just another unique name (e.g. instance ID or K8s Pod ID) as long as your stream layer knows how to map it to a real address to dial!

None of HashiCorp's tools do this today but it seems like a way you could get complete control over addressing behaviour with the library as it exists now. The exact details of how you discover the right set of addresses would be something you'd need to figure out though.

Warning: there are probably a few ways you could break things badly if you try this and don't think through all the edge cases. For example, this is only OK if it really is just the IP that changed. If you ever accidentally map an address in the raft config to a logically different node (different UUID, different state) all bets are off 😄 . Probably similar other cases if you actually loose the data as well as the node changing or especially if you accidentally redirect an address in config to a node that was part of a different raft cluster before etc. The UUID checks (assuming you configured LocalIDs) should save the worst outcomes of arbitrarily corrupting all the data but it could get messy!

It would be possible for this library to add support for providing a mapping and then it does that for you, but given that the logic for that would all be implemented in the StreamLayer which is already pluggable I'm not sure how useful that would really be? I think every use of this library in HashiCorp's products already has a custom StreamLayer anyway for example.

Does this help?

How do you currently support the cases where, the cluster is deployed in k8s and the cluster restarts ?

At least for HashiCorp tools this requires manual recovery using an explicit reconfiguration file we call peers.json. See the docs for details. Under the hood this file is just triggering a setup step that feeds the config into RecoverCluster in the raft lib.

otoolep commented 1 year ago

How do you currently support the cases where, the cluster is deployed in k8s and the cluster restarts ? I am currently building a similar solution and wondering how to handle the cluster restart scenarios where pods come back with different addresses ?

You need to use a stateful set, which keeps the network identifiers stable. See https://rqlite.io/docs/guides/kubernetes/

banks commented 1 year ago

You need to use a stateful set, which keeps the network identifies stable.

I didn't realise this, in that case I was wrong above about requiring recovery at least for our official Helm installs as we use StatefulSets, thanks for responding!

otoolep commented 1 year ago

This is because the leader will dial the address in it's config and you can't change that config if the leader can't dial the followers in the config in order to commit the change!

Right, exactly.

That said, I think it would be possible if you implement a custom StreamLayer for the NetworkTransport for your application to re-map the addresses. i.e. if you application knows somehow that node with UUID now has IP address 10.10.10.10 but the config says 10.20.20.20, you could implement a custom dialler that when it sees the address 10.20.20.20 actually dials 10.10.10.10.

Something similar occurred to me. You could even have a command-line parameter that tells a given node to override the IP address it sees in its config with one you pass at the command line. But then you need to make sure what you pass to each node when you start it is the same......but this whole approach is getting close to what peers.json actually does, so I'm not sure it's worth it.

wolfinch commented 1 year ago

Does it actually ? For example, I have a 3 node cluster with statefulset with 3 replicas and uses local disk backed PVs on each pod. Now, I have the following pod allocation on nodes, node0:pod0, nod1:pod1, node2:pod2. The raft is all setup with the nodeId as podIds. Now, if I restart my 3 node cluster and when it comes back, does K8s guarantee pod to node assignment guarantee? if not, let's say we get node0:pod2, node1:pod0 sort of assignment, when we bring up raft again from the previous state, the pod2 will now have states of older pod0 and wouldn't that create problem?

on another note, on statefulset, do you use the FQDN or pod IP as network identity ?

On Wed, Jun 28, 2023 at 5:50 AM otoolep @.***> wrote:

How do you currently support the cases where, the cluster is deployed in k8s and the cluster restarts ? I am currently building a similar solution and wondering how to handle the cluster restart scenarios where pods come back with different addresses ?

You need to use a stateful set, which keeps the network identifies stable. See https://rqlite.io/docs/guides/kubernetes/

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/raft/issues/532#issuecomment-1611342468, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZSILTQVXOEYK2RONHC7CTXNQSCPANCNFSM6AAAAAASQQVSHM . You are receiving this because you commented.Message ID: @.***>

-- -joe-

otoolep commented 1 year ago

I'm not a k8s expert, but the node must come up with the same combo of network identifier e.g. IP address and disk state each time. That is the definition of a node. If node state 0 comes up with network identifier 1, there will be a problem. But Stateful Sets solve this problem -- at least it works for me, and I've received no reports of problems. You should read the k8s docs -- your question seems separate from rqlite, but is about Stateful Sets generally.

The rqlite k8s uses IP addresses, resolved from the k8s DNS.

banks commented 1 year ago

I'm also not a K8s expert @ldmonko but do you mean you are using NodePort to expose each member via it's node's port? I don't think that will work.

Skimming https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#stable-network-id it doesn't call it out explicitly but I assume from the discussion of Kube DNS and stable names that it is assuming you use ClusterIPs - at least I'm not aware of it being possible to point Kube DNS Pod addresses at a Node IPs in K8s network model (since that IP could be shared by many pods on different ports).

In general I don't think tying instances to nodes is going to play well with StatefulSets - they are designed to provide a stable pod identity and network address (i.e. clusterIP and DNS name) for each pod regardless of where it is scheduled.

banks commented 1 year ago

I'm going to close this issue as I think we've answered the original question here and I don't think there is any actual change we can make to Raft that really improves this at this point. Open to continuing any discussing on deployment options here though and if anyone has a proposal for improved documentation to capture more of this discussion I suggest we open that as a separate PR.

Thanks folks.

otoolep commented 1 year ago

In general I don't think tying instances to nodes is going to play well with StatefulSets - they are designed to provide a stable pod identity and network address (i.e. clusterIP and DNS name) for each pod regardless of where it is scheduled.

Right, exactly. That's exactly what the recommended rqlite k8s service does. See https://github.com/rqlite/kubernetes-configuration/blob/master/service.yaml

hashicorp / raft

How to bootstrap a raft cluster with a fixed node setting? #532