lni / dragonboat

A feature complete and high performance multi-group Raft library in Go.
Apache License 2.0
5.09k stars 542 forks source link

Suggested operations for moving raft shards across replicas in cluster (v4) #377

Open danthegoodman1 opened 5 days ago

danthegoodman1 commented 5 days ago

I'm playing with using multi-raft in a range-partitioned DB, and one thing I am working on in particular is being able to move raft shards around the cluster to balance load. I have a couple questions, and looking for any general advice on the topic, specifically for:

  1. Moving raft shards among the cluster
  2. Transferring the leader to the intended node
  3. Moving raft shards back to nodes that were previously moved away from

Moving raft shards

My assumption for moving raft shards is to:

  1. Join new replicas to the shard as voting replicas (e.g. join 3 more to the initial 3 in a worst-case move)
  2. Once all new replicas have successfully joined, start deleting the initial replicas
  3. Once all old replicas are removed, NodeHost.RequestLeaderTransfer on the intended new leader (least loaded replica)

Is this correct? Or is there a better way to do partial/full replica movement?

Transferring leadership

I notice that for NodeHost.RequestLeaderTransfer it is stated There is no guarantee that such request can be fulfilled, and is not also something you can wait a channel for like other Request* operations. I'm assuming this is not guaranteed because it would not be able to select the replica that's not as caught up as other replicas? Maybe some other reasons.

And so, how would you suggest that Transferring leadership is done when a specific shard is being moved because of high load for the overall replica?

Maybe something like https://github.com/lni/dragonboat/blob/master/nodehost_test.go#L1709 could be used to wait for leadership change, or time out and error/warn log (or retry)?

Moving back to nodes that were previously removed

Additionally, I notice that for NodeHost.RequestDeleteReplica it is stated Once a node is successfully deleted from a Raft shard, it will not be allowed to be added back to the shard with the same node identity.. Would the suggestion then be that node:shard relationships be dynamic? Otherwise this shard could never be moved back on to nodes that it previously lived on. I don't think a dynamic node name seems natural, but maybe it could be encoded like {count of how many times that shard has been on this node id} * 100_000 + {real node id}, and then the cluster is technically limited to 100k nodes? Or should a mapping of real node -> shard replica ID be tracked in a meta shard?

danthegoodman1 commented 5 days ago

https://github.com/lni/dragonboat/blob/master/nodehost_test.go#L2621-L2634 seems to conflict with the testing practices from the README as well

lni commented 5 days ago
  1. move raft shards

you may want to join new replicas as non-voting member first. this allow them some time to catch up with the progress of your shard without impacting the the underlying consensus mechanisms. once they are up to speed (e.g. a ReadIndex read can be successfully completed on those non-voting members), you can promote them to voting members.

leadership transfer can be optional, unless you require a certain replica to be the leader

  1. leadership transfer

there are many exceptions that can happen during the leadership transfer, such request is thus always considered as following a best effort approach. you can wait some time and then check who is the leader now. again, such check doesn't mean too much, as the leadership can just change any time after your check is done (before you get the check result).

note that such issues are typical distributed system problems, there is simply no such strong guarantees similar to those you'd be expecting from a single machine system.

  1. Moving back to nodes that were previously removed

you are free to do whatever you want regarding those replica IDs, the only limitation here is not to have any duplication. with the replica ID value being uint64, a random value should be good enough assuming it is indeed reasonably random.

kevburnsjr commented 4 days ago
  1. Transferring leadership

Rather than actively polling, you could pass a RaftEventListener in your NodeHostConfig. Each host's RaftEventListener will be automatically notified of all leadership changes on all shards for which the host has an active replica. From my testing, this hook is highly reliable.

  1. Moving back to nodes that were previously removed

I think there's a typo in the documentation for RequestDeleteReplica mixing v3 and v4 nomenclature that may be causing some confusion. In v4 nomenclature it should read:

"Once a replica is successfully deleted from a Raft shard, it will not be allowed to be added back to the shard with the same replica identity."

To re-join a host to a shard after its replica has been removed you just need to call RequestAddReplica from an active host with a new (unique) replica id and your new host as the target.

danthegoodman1 commented 4 days ago

Thanks for the details folks! For a random replica id I guess that would mean it now has to be persisted since it’s decoupled from a static node id and they need to be able to restart with the same replica id

lni commented 3 days ago

@danthegoodman1

from experience, quite often a production system will have a separate devops sub-system to keep monitoring those involved nodes and start repairing shards when any replica is dead. by doing that, it will have to somehow remember all replica IDs anyway.

kevburnsjr commented 3 days ago

No need to persist the replica ID out of band. It is persisted by the host. Restarting a node/host looks like this:

  1. Instantiate the NodeHost
  2. Call GetNodeHostInfo
  3. Iterate returned ShardInfoList
  4. StartReplicas
danthegoodman1 commented 3 days ago

Oh fantastic, thanks for sharing!

from experience, quite often a production system will have a separate devops sub-system to keep monitoring those involved nodes and start repairing shards when any replica is dead. by doing that, it will have to somehow remember all replica IDs anyway.

Yes I was planning on having a "meta shard" that tracks the existence and range ownership of other groups (and makes decisions for balancing shards among the cluster, node join, etc.)