rabbitmq / ra

A Raft implementation for Erlang and Elixir that strives to be efficient and make it easier to use multiple Raft clusters in a single system.
Other
813 stars 96 forks source link

Incorrect index version pairs can lead to nodes taking unrecoverable snapshots #273

Closed tomyouyou closed 2 years ago

tomyouyou commented 2 years ago

To reproduce the issue:

  1. Build a 3-node cluster with rabbitmq-server-3.9.14-1.el7.noarch.rpm

  2. Create a quorum queue 'sq12' and its machine version is v1.

  3. Upgrade one of the nodes with rabbitmq-server-3.10.0-1.el8.noarch.rpm, assuming that the node is node-new-ver. The leader of 'sq12' is an old version node.
    An segment file of 'sq12' was created in the node-new-ver.

  4. Publish a mesage to 'sq12'. Create a consumer to receive and ack the message. In this way, a snapshot will be generated due to the segment file on the node-new-ver. The machine state version of the snapshot is v1, However, the 'machine_version' inside the snapshot meta is v2 instead of v1. Therefore, recovering from this illegal snapshot will cause an exception.

  5. Restart the new version node. When recovering the queue from the snapshot, an exception is happened.

2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0> ** State machine '%2F_sq12' terminating
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0> ** Last event = {internal,init_state}
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0> ** When server state  = {recover,"ra_server_proc:format_status/2 crashed"}
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0> ** Reason for termination = error:function_clause
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0> ** Callback modules = [ra_server_proc]
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0> ** Callback mode = [state_functions,state_enter]
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0> ** Stacktrace =
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0> **  [{rabbit_fifo,state_enter,
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>          [recover,
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>           {rabbit_fifo,
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>               {cfg,'%2F_sq12',
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>                   {resource,<<"/">>,queue,<<"sq12">>},
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>                   {2048,2048},
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>                   {rabbit_quorum_queue,dead_letter_publish,
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>                       [undefined,undefined,
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>                        {resource,<<"/">>,queue,<<"sq12">>}]},
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>                   {rabbit_quorum_queue,become_leader,
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>                       [{resource,<<"/">>,queue,<<"sq12">>}]},
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>                   drop_head,undefined,undefined,competing,undefined,undefined,
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>                   undefined,undefined,undefined,undefined},
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>               {0,[],[]},
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>               2,
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>               {0,[],[]},
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>               0,#{},
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>               {rabbit_fifo_index,#{},undefined,undefined},
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>               {0,[],[]},
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>               #{{<<"msg_test_ctag_0">>,<0.837.0>} =>
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>                     {consumer,
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>                         #{ack => true,args => [],prefetch => 0,
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>                           username => <<"guest">>},
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>                         #{},1,2000,1,simple_prefetch,auto,up,0}},
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>               {queue,[{<<"msg_test_ctag_0">>,<0.837.0>}],[],1},
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>               {0,[],0,[]},
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>               0,0,[],0,0,1651742717036,undefined,undefined}],
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>          [{file,"rabbit_fifo.erl"},{line,810}]},
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>      {ra_machine,state_enter,3,[{file,"src/ra_machine.erl"},{line,284}]},
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>      {ra_server,handle_state_enter,2,[{file,"src/ra_server.erl"},{line,1336}]},
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>      {ra_server_proc,handle_enter,3,
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>          [{file,"src/ra_server_proc.erl"},{line,960}]},
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>      {ra_server_proc,recover,3,[{file,"src/ra_server_proc.erl"},{line,320}]},
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>      {gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1203}]},
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]
2022-05-05 17:35:34.588547+08:00 [erro] <0.507.0>
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>   crasher:
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>     initial call: ra_server_proc:init/1
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>     pid: <0.507.0>
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>     registered_name: '%2F_sq12'
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>     exception error: no function clause matching
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                      rabbit_fifo:state_enter(recover,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                              {rabbit_fifo,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                               {cfg,'%2F_sq12',
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                {resource,<<"/">>,queue,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                 <<"sq12">>},
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                {2048,2048},
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                {rabbit_quorum_queue,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                 dead_letter_publish,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                 [undefined,undefined,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                  {resource,<<"/">>,queue,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                   <<"sq12">>}]},
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                {rabbit_quorum_queue,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                 become_leader,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                 [{resource,<<"/">>,queue,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                   <<"sq12">>}]},
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                drop_head,undefined,undefined,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                competing,undefined,undefined,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                undefined,undefined,undefined,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                undefined},
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                               {0,[],[]},
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                               2,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                               {0,[],[]},
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                               0,#{},
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                               {rabbit_fifo_index,#{},
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                undefined,undefined},
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                               {0,[],[]},
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                               #{{<<"msg_test_ctag_0">>,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                  <0.837.0>} =>
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                  {consumer,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                   #{ack => true,args => [],
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                     prefetch => 0,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                     username => <<"guest">>},
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                   #{},1,2000,1,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                   simple_prefetch,auto,up,0}},
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                               {queue,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                [{<<"msg_test_ctag_0">>,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                  <0.837.0>}],
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                                [],1},
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                               {0,[],0,[]},
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                               0,0,[],0,0,1651742717036,
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>                                               undefined,undefined}) (rabbit_fifo.erl, line 810)
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>       in function  ra_machine:state_enter/3 (src/ra_machine.erl, line 284)
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>       in call from ra_server:handle_state_enter/2 (src/ra_server.erl, line 1336)
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>       in call from ra_server_proc:handle_enter/3 (src/ra_server_proc.erl, line 960)
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>       in call from ra_server_proc:recover/3 (src/ra_server_proc.erl, line 320)
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>       in call from gen_statem:loop_state_callback/11 (gen_statem.erl, line 1203)
2022-05-05 17:35:34.588980+08:00 [erro] <0.507.0>     ancestors: [<0.506.0>,ra_server_sup_sup,<0.307.0>,ra_systems_sup,ra_sup,
michaelklishin commented 2 years ago

Related to #256, #268.

michaelklishin commented 2 years ago

@tomyouyou this does not add any new tests. Do you think it would be realistic to add some?

michaelklishin commented 2 years ago

@kjnilsson should be back next week to review this.

michaelklishin commented 2 years ago

Ignore the OCI image publishing failure, this repo hasn't been updated to only attempt publish when Actions has access to the credentials used (which depends on who submits the PR).

kjnilsson commented 2 years ago

Ok I've reviewed this change and I think it is good. Thank you @tomyouyou

Writing a test for it is possible but convoluted. We use meck to fake updated module versions and AFAIK meck is node global so we'd have to use peer (slave) nodes to test a scenario where a snapshot for a lower version is taken by a member with a higher version then restarting.