rabbitmq / ra

A Raft implementation for Erlang and Elixir that strives to be efficient and make it easier to use multiple Raft clusters in a single system.
Other
798 stars 93 forks source link

Sync name.dets to disk #443

Closed visciang closed 1 month ago

visciang commented 1 month ago

Proposed Changes

I have seen names.dets being left not in sync with the system state (server directory name).

The operations that trigger the problem are:

  1. start the app
  2. create a raft server X (UID-123)
  3. leave_and_delete the server X
  4. re create the raft server X (UID-456)
  5. immediately stop the app (<<<=== before the DETS auto_sync the new registered name to the disk - 500ms)

So, in other words, we are "resetting" an existing ra server (3+4).

What I've seen is that names.dets refer to the ra server identifier created at step 2 - UID-123 (the first instance of the server X). When I restart the application, ra crashes while strarting with error enoent since the ra server directory UID-123 does not exist anymore (deleted by step 3).

Here a view of what I see - names.dets refer to the server "ELIXIRXR9LYKF78Z3R" (it's the one created on step 2) while on disk I see the server "ELIXIRHP06E8L3EYLT" (created on step 4).

image

I've been able to "fix" the problem introducing a dets:sync call after each change operation in ra_directory module, following what is already in place in the ra_log_metadata module.

@kjnilsson I have not jet created an ISSUE to link to this PR, would like to get a feedback from you first. Even if the change is small, and I think should not break existing stuff and lower down the performance, maybe there is a better architectural approach you can come to - since given the non atomicity of writing to a dets can always lead to a dets that has not been updated (ie erlang app killed after a dets:insert but before the dets:sync). Let me know if you need more information.

Types of Changes

Checklist

pivotal-cla commented 1 month ago

@visciang Please sign the Contributor License Agreement!

Click here to manually synchronize the status of this Pull Request.

See the FAQ for frequently asked questions.

pivotal-cla commented 1 month ago

@visciang Thank you for signing the Contributor License Agreement!

michaelklishin commented 1 month ago

@visciang it's fine not to file an issue first if you explain the reasoning behind the change and various design considerations (if any) in the PR. I think we have enough information here.

Thank you for taking the time to contribute!

michaelklishin commented 1 month ago

FTR, the functions affected should not be used on the hot code path, so this PR is very unlikely to introduce a throughput hit with reasonable workloads:

kjnilsson commented 1 month ago

This is likely to affect definition import performance in RabbitMQ.

Karl Nilsson

On Sun, 19 May 2024 at 05:06, Michael Klishin @.***> wrote:

@.**** approved this pull request.

— Reply to this email directly, view it on GitHub https://github.com/rabbitmq/ra/pull/443#pullrequestreview-2064987335, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJAHFD7AWX7ACNZT4VL3KTZDAQODAVCNFSM6AAAAABH5N2BM2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDANRUHE4DOMZTGU . You are receiving this because you were mentioned.Message ID: @.***>

kjnilsson commented 1 month ago

Cheers