Open danthegoodman1 opened 5 days ago
https://github.com/lni/dragonboat/blob/master/nodehost_test.go#L2621-L2634 seems to conflict with the testing practices from the README as well
you may want to join new replicas as non-voting member first. this allow them some time to catch up with the progress of your shard without impacting the the underlying consensus mechanisms. once they are up to speed (e.g. a ReadIndex read can be successfully completed on those non-voting members), you can promote them to voting members.
leadership transfer can be optional, unless you require a certain replica to be the leader
there are many exceptions that can happen during the leadership transfer, such request is thus always considered as following a best effort approach. you can wait some time and then check who is the leader now. again, such check doesn't mean too much, as the leadership can just change any time after your check is done (before you get the check result).
note that such issues are typical distributed system problems, there is simply no such strong guarantees similar to those you'd be expecting from a single machine system.
you are free to do whatever you want regarding those replica IDs, the only limitation here is not to have any duplication. with the replica ID value being uint64, a random value should be good enough assuming it is indeed reasonably random.
- Transferring leadership
Rather than actively polling, you could pass a RaftEventListener in your NodeHostConfig. Each host's RaftEventListener will be automatically notified of all leadership changes on all shards for which the host has an active replica. From my testing, this hook is highly reliable.
- Moving back to nodes that were previously removed
I think there's a typo in the documentation for RequestDeleteReplica mixing v3 and v4 nomenclature that may be causing some confusion. In v4 nomenclature it should read:
"Once a replica is successfully deleted from a Raft shard, it will not be allowed to be added back to the shard with the same replica identity."
To re-join a host to a shard after its replica has been removed you just need to call RequestAddReplica from an active host with a new (unique) replica id and your new host as the target.
Thanks for the details folks! For a random replica id I guess that would mean it now has to be persisted since it’s decoupled from a static node id and they need to be able to restart with the same replica id
@danthegoodman1
from experience, quite often a production system will have a separate devops sub-system to keep monitoring those involved nodes and start repairing shards when any replica is dead. by doing that, it will have to somehow remember all replica IDs anyway.
No need to persist the replica ID out of band. It is persisted by the host. Restarting a node/host looks like this:
Oh fantastic, thanks for sharing!
from experience, quite often a production system will have a separate devops sub-system to keep monitoring those involved nodes and start repairing shards when any replica is dead. by doing that, it will have to somehow remember all replica IDs anyway.
Yes I was planning on having a "meta shard" that tracks the existence and range ownership of other groups (and makes decisions for balancing shards among the cluster, node join, etc.)
I'm playing with using multi-raft in a range-partitioned DB, and one thing I am working on in particular is being able to move raft shards around the cluster to balance load. I have a couple questions, and looking for any general advice on the topic, specifically for:
Moving raft shards
My assumption for moving raft shards is to:
NodeHost.RequestLeaderTransfer
on the intended new leader (least loaded replica)Is this correct? Or is there a better way to do partial/full replica movement?
Transferring leadership
I notice that for NodeHost.RequestLeaderTransfer it is stated
There is no guarantee that such request can be fulfilled
, and is not also something you can wait a channel for like other Request* operations. I'm assuming this is not guaranteed because it would not be able to select the replica that's not as caught up as other replicas? Maybe some other reasons.And so, how would you suggest that Transferring leadership is done when a specific shard is being moved because of high load for the overall replica?
Maybe something like https://github.com/lni/dragonboat/blob/master/nodehost_test.go#L1709 could be used to wait for leadership change, or time out and error/warn log (or retry)?
Moving back to nodes that were previously removed
Additionally, I notice that for NodeHost.RequestDeleteReplica it is stated
Once a node is successfully deleted from a Raft shard, it will not be allowed to be added back to the shard with the same node identity.
. Would the suggestion then be that node:shard relationships be dynamic? Otherwise this shard could never be moved back on to nodes that it previously lived on. I don't think a dynamic node name seems natural, but maybe it could be encoded like{count of how many times that shard has been on this node id} * 100_000 + {real node id}
, and then the cluster is technically limited to 100k nodes? Or should a mapping of real node -> shard replica ID be tracked in a meta shard?