rabbitmq / ra

A Raft implementation for Erlang and Elixir that strives to be efficient and make it easier to use multiple Raft clusters in a single system.
Other
798 stars 93 forks source link

ra_server: Update leaderboard after each command if cluster changed #417

Closed dumbbell closed 4 months ago

dumbbell commented 5 months ago

Why

In a RabbitMQ test with a 5-node cluster where Erlang nodes are killed and restarted in a rolling fashion, I observed that the leaderboards might be inconsistent across the cluster.

Logs from node 1:

2024-02-06 12:29:15.792058+01:00 [debug] <0.251.0> Trying to restart local Ra server for store "rabbitmq_metadata" in Ra system "coordination"
2024-02-06 12:29:15.810508+01:00 [debug] <0.296.0> RabbitMQ metadata store: ra_log:init recovered last_index_term {60,1} first index 0
2024-02-06 12:29:15.813225+01:00 [debug] <0.296.0> RabbitMQ metadata store: post_init -> recover in term: 1 machine version: 0
2024-02-06 12:29:15.813291+01:00 [debug] <0.296.0> RabbitMQ metadata store: recovering state machine version 0:0 from index 0 to 0
2024-02-06 12:29:15.813347+01:00 [debug] <0.296.0> RabbitMQ metadata store: recovery of state machine version 0:0 from index 0 to 0 took 0ms
2024-02-06 12:29:15.813413+01:00 [info] <0.251.0> Waiting for Khepri leader for 30000 ms, 9 retries left
2024-02-06 12:29:15.813428+01:00 [debug] <0.296.0> RabbitMQ metadata store: recover -> recovered in term: 1 machine version: 0
2024-02-06 12:29:15.813484+01:00 [debug] <0.296.0> RabbitMQ metadata store: recovered -> follower in term: 1 machine version: 0
2024-02-06 12:29:15.813520+01:00 [debug] <0.296.0> RabbitMQ metadata store: is not new, setting election timeout.
2024-02-06 12:29:15.818609+01:00 [info] <0.251.0> Khepri leader elected
2024-02-06 12:29:15.820842+01:00 [debug] <0.296.0> RabbitMQ metadata store: leader call - leader not known. Command will be forwarded once leader is known.
2024-02-06 12:29:16.176876+01:00 [debug] <0.296.0> RabbitMQ metadata store: pre_vote election called for in term 1
2024-02-06 12:29:16.182866+01:00 [debug] <0.296.0> RabbitMQ metadata store: follower -> pre_vote in term: 1 machine version: 0
2024-02-06 12:29:16.183051+01:00 [debug] <0.296.0> RabbitMQ metadata store: pre_vote granted #Ref<0.2919399944.1500250113.30962> for term 1 votes 1
2024-02-06 12:29:16.183247+01:00 [debug] <0.296.0> RabbitMQ metadata store: election called for in term 2
2024-02-06 12:29:16.186666+01:00 [debug] <0.296.0> RabbitMQ metadata store: pre_vote -> candidate in term: 2 machine version: 0
2024-02-06 12:29:16.186815+01:00 [debug] <0.296.0> RabbitMQ metadata store: vote granted for term 2 votes 1
2024-02-06 12:29:16.186996+01:00 [notice] <0.296.0> RabbitMQ metadata store: candidate -> leader in term: 2 machine version: 0
2024-02-06 12:29:16.187220+01:00 [alert] <0.296.0> RA LEADER RECORD: ClusterName=rabbitmq_metadata Leader={rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'} Members=[{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'}]
2024-02-06 12:29:16.190313+01:00 [debug] <0.296.0> RabbitMQ metadata store: applying ra cluster change to [{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-5-22632@localhost'}]
2024-02-06 12:29:16.190648+01:00 [debug] <0.296.0> RabbitMQ metadata store: applying ra cluster change to [{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-2-22488@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-5-22632@localhost'}]
2024-02-06 12:29:16.190878+01:00 [debug] <0.296.0> RabbitMQ metadata store: applying ra cluster change to [{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-2-22488@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-3-22536@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-5-22632@localhost'}]
2024-02-06 12:29:16.191393+01:00 [debug] <0.296.0> RabbitMQ metadata store: applying ra cluster change to [{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-2-22488@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-3-22536@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-4-22584@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-5-22632@localhost'}]
2024-02-06 12:29:16.200058+01:00 [debug] <0.296.0> RabbitMQ metadata store: enabling ra cluster changes in 2, index 61
2024-02-06 12:29:16.302892+01:00 [debug] <0.251.0> Khepri-based RabbitMQ metadata store ready

Logs from node 2:

2024-02-06 12:29:18.714368+01:00 [debug] <0.260.0> Trying to restart local Ra server for store "rabbitmq_metadata" in Ra system "coordination"
2024-02-06 12:29:18.732537+01:00 [debug] <0.310.0> RabbitMQ metadata store: ra_log:init recovered last_index_term {74,2} first index 0
2024-02-06 12:29:18.735899+01:00 [debug] <0.310.0> RabbitMQ metadata store: post_init -> recover in term: 2 machine version: 0
2024-02-06 12:29:18.735944+01:00 [debug] <0.310.0> RabbitMQ metadata store: recovering state machine version 0:0 from index 0 to 60
2024-02-06 12:29:18.736488+01:00 [info] <0.260.0> Waiting for Khepri leader for 30000 ms, 9 retries left
2024-02-06 12:29:18.741696+01:00 [debug] <0.310.0> RabbitMQ metadata store: applying ra cluster change to [{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-5-22632@localhost'}]
2024-02-06 12:29:18.741784+01:00 [debug] <0.310.0> RabbitMQ metadata store: applying ra cluster change to [{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-2-22488@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-5-22632@localhost'}]
2024-02-06 12:29:18.741852+01:00 [debug] <0.310.0> RabbitMQ metadata store: applying ra cluster change to [{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-2-22488@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-3-22536@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-5-22632@localhost'}]
2024-02-06 12:29:18.741969+01:00 [debug] <0.310.0> RabbitMQ metadata store: applying ra cluster change to [{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-2-22488@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-3-22536@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-4-22584@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-5-22632@localhost'}]
2024-02-06 12:29:18.743421+01:00 [debug] <0.310.0> RabbitMQ metadata store: recovery of state machine version 0:0 from index 0 to 60 took 8ms
2024-02-06 12:29:18.743487+01:00 [debug] <0.310.0> RabbitMQ metadata store: recover -> recovered in term: 2 machine version: 0
2024-02-06 12:29:18.748414+01:00 [debug] <0.310.0> RabbitMQ metadata store: recovered -> follower in term: 2 machine version: 0
2024-02-06 12:29:18.748445+01:00 [debug] <0.310.0> RabbitMQ metadata store: is not new, setting election timeout.
2024-02-06 12:29:18.748591+01:00 [info] <0.260.0> Khepri leader elected
2024-02-06 12:29:18.748642+01:00 [debug] <0.310.0> RabbitMQ metadata store: leader call - leader not known. Command will be forwarded once leader is known.
2024-02-06 12:29:19.036638+01:00 [debug] <0.310.0> RabbitMQ metadata store: pre_vote election called for in term 2
2024-02-06 12:29:19.040483+01:00 [debug] <0.310.0> RabbitMQ metadata store: follower -> pre_vote in term: 2 machine version: 0
2024-02-06 12:29:19.040652+01:00 [debug] <0.310.0> RabbitMQ metadata store: pre_vote granted #Ref<0.772528231.2305294338.197063> for term 2 votes 1
2024-02-06 12:29:19.043116+01:00 [debug] <0.310.0> RabbitMQ metadata store: pre_vote -> follower in term: 2 machine version: 0
2024-02-06 12:29:19.043212+01:00 [debug] <0.310.0> RabbitMQ metadata store: is not new, setting election timeout.
2024-02-06 12:29:19.044068+01:00 [debug] <0.310.0> RabbitMQ metadata store: enabling ra cluster changes in 2, index 61
2024-02-06 12:29:19.061918+01:00 [alert] <0.310.0> RA LEADER RECORD: ClusterName=rabbitmq_metadata Leader={rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'} Members=[{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-2-22488@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-3-22536@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-4-22584@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-5-22632@localhost'}]
2024-02-06 12:29:19.062191+01:00 [info] <0.310.0> RabbitMQ metadata store: detected a new leader {rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'} in term 2
2024-02-06 12:29:19.062378+01:00 [debug] <0.310.0> RabbitMQ metadata store: Leader node 'rmq-ct-rolling_kill_restart-1-22440@localhost' may be down, setting pre-vote timeout
2024-02-06 12:29:19.151724+01:00 [debug] <0.260.0> Khepri-based RabbitMQ metadata store ready

Logs on nodes 3, 4 and 5 are similar to node 2.

Note that messages starting with "RA LEADER RECORD" were added to ra_leaderboard:record/3 during debugging and are not committed to Ra.

We can see that node 1 is elected the leader on recovery before the ra_cluster_change commands are applied. It stays the leader afterwards but never update its leaderboard.

How

Instead of emitting an update_leaderboard effect after a cluster change command was appended or after some specific events, we check if the cluster changed after a command was applied and update the leaderboard if it did.

This check is performed on the leader and all followers. It is also performed once a Ra server finished recovery.

V2: This revision changes the approach. The first iteration updated the leaderboard after a ra_cluster_change command was applied in addition of the existing updates.

dumbbell commented 5 months ago

FTR, the testcase in RabbitMQ that is failing because of this:

gmake -C deps/rabbit ct-clustering_recovery t=cluster_size_5:rolling_kill_restart

It takes a few runs locally to hit the problem.