rabbitmq / ra

A Raft implementation for Erlang and Elixir that strives to be efficient and make it easier to use multiple Raft clusters in a single system.
Other
813 stars 96 forks source link

Stop checkpoint validation when encountering a valid checkpoint #463

Closed the-mikedavis closed 1 month ago

the-mikedavis commented 2 months ago

@mkuratczyk noticed that with many QQs on the qq-v4 branch and each QQ having many checkpoints, we spend a fair amount of effort reading the checkpoints during recovery. This is because ra_snapshot:find_checkpoints/1 uses the ra_snapshot:validate/1 callback to ensure that each snapshot is valid. validate/1 is somewhat expensive in ra_log_snapshot since it fully reads and decodes the checkpoint, discarding the result.

Not all of this validation is necessary: we can stop validating checkpoints when we find the latest checkpoint which is valid. This is likely to be good enough. I've also updated find_checkpoints/1 to stop its search when it finds a checkpoint with a lower index than the current snapshot as any checkpoints lower than the snapshot index won't be used for promotion and should be removed. For many QQs with many checkpoints each this should save some I/O usage and memory.

the-mikedavis commented 2 months ago

I took some rough measurements with tprof from OTP 27. The gist is that time and memory savings look pretty good: 1.62s down to 0.28s and 178 million words of memory down to ~20 million for ra_snapshot:init/6 on a QQ's checkpoint directory (from the qq-v4 branch) with 5 million messages.

Results... Queue created with `perf-test -qq -u qq -x 1 -y 0 -C 5000000 -c 3000` Measured with: ```erlang tprof:profile(fun() -> ra_snapshot:init(<<"uuid">>, ra_log_snapshot, "./snapshots", "./checkpoints", undefined, 3) end, #{type => call_time}). ``` and `#{type => call_memory}` for the memory breakdowns. This branch: ``` FUNCTION CALLS TIME (μs) PER CALL [ %] ... erlang:universaltime_to_localtime/1 6 69 11.50 [ 0.02] prim_file:close_nif/1 17 91 5.35 [ 0.03] prim_file:list_dir_nif/1 2 92 46.00 [ 0.03] prim_file:read_nif/2 34 155 4.56 [ 0.05] file:file_name_1/2 1037 192 0.19 [ 0.07] filename:join1/4 1914 203 0.11 [ 0.07] prim_file:open_nif/2 17 291 17.12 [ 0.10] erlang:crc32/1 1 13109 13109.00 [ 4.54] prim_file:read_file_nif/1 1 52475 52475.00 [18.18] ra_log_snapshot:parse_snapshot/1 1 93138 93138.00 [32.26] erlang:binary_to_term/1 19 128266 6750.84 [44.43] 288697 [100.0] ``` 0.28s ``` FUNCTION CALLS WORDS PER CALL [ %] ... prim_file:internal_native2name/1 17 1122 66.00 [ 0.01] file:file_name_1/2 1037 2040 1.97 [ 0.01] lists:reverse/2 42 3726 88.71 [ 0.02] filename:join1/4 1914 3758 1.96 [ 0.02] erlang:crc32/1 1 7450 7450.00 [ 0.04] erlang:binary_to_term/1 19 19871694 1045878.63 [99.89] 19892656 [100.0] ``` --- main: ``` FUNCTION CALLS TIME (μs) PER CALL [ %] ... erlang:universaltime_to_localtime/1 6 71 11.83 [ 0.00] prim_file:list_dir_nif/1 2 93 46.50 [ 0.01] prim_file:close_nif/1 17 165 9.71 [ 0.01] file:file_name_1/2 1037 168 0.16 [ 0.01] prim_file:read_nif/2 34 190 5.59 [ 0.01] filename:join1/4 2890 249 0.09 [ 0.02] prim_file:open_nif/2 17 704 41.41 [ 0.04] erlang:crc32/1 17 141498 8323.41 [ 8.72] prim_file:read_file_nif/1 17 154438 9084.59 [ 9.52] ra_log_snapshot:parse_snapshot/1 17 520271 30604.18 [32.08] erlang:binary_to_term/1 51 803040 15745.88 [49.51] 1621975 [100.0] ``` ``` FUNCTION CALLS WORDS PER CALL [ %] ... prim_file:internal_native2name/1 17 1122 66.00 [ 0.00] file:file_name_1/2 1037 2040 1.97 [ 0.00] lists:reverse/2 57 5552 97.40 [ 0.00] filename:join1/4 2890 5678 1.96 [ 0.00] erlang:crc32/1 17 66600 3917.65 [ 0.04] erlang:binary_to_term/1 51 177721626 3484737.76 [99.95] 177806402 [100.0] ```
kjnilsson commented 2 months ago

Other checkpoints we can validate during promotion and discard ones that fai

For a quorum queue were consumers keep up with ingress checkpoints are promoted very often. It would be nice not to to have to do the validation work every time just because we optimised recovery. My thought was that once we'd found a valid checkpoint during recovery we'd assume all prior checkpoints are also valid. That should be roughly as good as promoting any other checkpoint.

The most likely way a checkpoint would become corrupted is if the server hard stopped during a write or fsync. Sure there are other ways checkpoints could become corrupted but at least we guard against the most likely one.