GM - crash in calculate_activity

dcorbacho commented 8 years ago

Found while testing #944, using HA queues and autoheal (same testing as for #914).

=ERROR REPORT==== 23-Aug-2016::16:14:37 ===
** Generic server <0.7351.1> terminating
** Last message in was {'$gen_cast',
                           {'$gm',6,
                               {activity,
                                   {0,<32923.11493.0>},
                                   [{{0,<32920.9321.1>},[],[0,1,2,3]}]}}}
** When Server state == {state,
                            {1,<0.7351.1>},
                            {{0,<32923.11493.0>},#Ref<0.0.1.157223>},
                            {{0,<32923.11493.0>},#Ref<0.0.1.157224>},
                            {resource,<<"/">>,queue,<<"test_21">>},
                            rabbit_mirror_queue_slave,
                            {6,
                             [{{0,<32923.11493.0>},
                               {view_member,
                                   {0,<32923.11493.0>},
                                   [],
                                   {1,<0.7351.1>},
                                   {1,<0.7351.1>}}},
                              {{1,<0.7351.1>},
                               {view_member,
                                   {1,<0.7351.1>},
                                   [{0,<32920.9321.1>}],
                                   {0,<32923.11493.0>},
                                   {0,<32923.11493.0>}}}]},
                            0,
                            [{{0,<32923.11493.0>},{member,{[],[]},0,0}},
                             {{0,<32920.9321.1>},{member,{[],[]},3,3}},
                             {{1,<0.7351.1>},{member,{[],[]},0,0}}],
                            [<0.7350.1>],
                            {[],[]},
                            [],0,undefined,
                            #Fun<rabbit_misc.execute_mnesia_transaction.1>,
                            false}
** Reason for termination ==
** {{badmatch,false},
    [{gm,last_ack,2,[{file,"src/gm.erl"},{line,1580}]},
     {gm,'-calculate_activity/5-fun-0-',7,[{file,"src/gm.erl"},{line,1422}]},
     {gm,with_member_acc,3,[{file,"src/gm.erl"},{line,1347}]},
     {lists,foldl,3,[{file,"lists.erl"},{line,1263}]},
     {gm,handle_msg,2,[{file,"src/gm.erl"},{line,865}]},
     {gm,handle_cast,2,[{file,"src/gm.erl"},{line,645}]},
     {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1032}]},
     {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,257}]}]}

** Reason for termination ==
** {{badmatch,false},
    [{gm,last_pub,2,[{file,"src/gm.erl"},{line,1584}]},
     {gm,'-calculate_activity/5-fun-0-',7,[{file,"src/gm.erl"},{line,1433}]},
     {gm,with_member_acc,3,[{file,"src/gm.erl"},{line,1346}]},
     {lists,foldl,3,[{file,"lists.erl"},{line,1263}]},
     {gm,handle_msg,2,[{file,"src/gm.erl"},{line,864}]},
     {gm,handle_cast,2,[{file,"src/gm.erl"},{line,645}]},
     {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1032}]},
     {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,257}]}]}

dcorbacho commented 8 years ago

As in #944, partial partitions cause the coexistence of several masters in the same cluster. When the nodes get reconnected, the master exchange messages with existing slaves - expecting them to be newly started slaves - but those have just been synchronised or received messages from other master. Thus, message queues get out of sync and status do not match.

This requires an enhanced consensus algorithm to avoid the root cause.

michaelklishin commented 8 years ago

To make it clear, there are plans to at least evaluate Raft in a few places after the 3.7.0 release.

rabbitmq / rabbitmq-server

GM - crash in calculate_activity #959