Closed SimonUnge closed 1 year ago
Example implementation:
handle_status(leader, {ClusterName, _} = Leader, Cluster, _State, Node, nodeup) ->
%% Figure out if we should add or do nothing
Conf = make_conf(ClusterName, {ClusterName, Node}, Cluster),
[{add_member, Conf, {ClusterName, Node}, Cluster}];
handle_status(leader, {ClusterName, _} = Leader, Cluster, _State, Node, nodedown) ->
%% Figure out if we should remove or do nothing
[{remove_member, {ClusterName, Node}, Cluster}];
handle_status(leader, _Leader, _Cluster, _State, ServerId, {What, Result}) when What == add_member_result;
What == remove_member_result ->
%% do something with the result...
[];
Ok so I have a few thoughts.
I think the ra_machine
API should be specific to membership evaluation so something like ra_machine:eval_members
or similar. I think it should be called by a timer which by default has a very long interval, say 1hr+. Then we use the nodeup/nodedown handler to shorten the timer interval by some randomised value, (say less that 1 minute) to ensure more timely handling of membership changes when nodes join / leave.
We need to ensure that we don't trigger concurrent membership change tasks (which I believe is possible in the current PR) so we need to monitor the process that is spawned and not evaluate members whilst the task is running. Once the task finishes we can re issue the short timer until the eval members return no changes, then we'll set the long timer again.
We may not want to call ra_machine:eval_members
inside the Ra leader process because the code for discovering up nodes in RabbitMQ may be blocking and/or a bit slow. I think for now we can leave it in process.
Very much a DRAFT. @kjnilsson