Crash on tree handling in rabbit_variable_queue

dcorbacho commented 8 years ago

Found while testing #944, using HA queues and autoheal (same testing as for #914).

=ERROR REPORT==== 23-Aug-2016::16:05:46 ===
** Generic server <0.32209.0> terminating
** Last message in was {'$gen_cast',{ack,"<E9>",<32920.259.1>}}
** When Server state == {q,
                         {amqqueue,
                          {resource,<<"/">>,queue,<<"test_6">>},
                          false,false,none,[],<0.32209.0>,
                          [<32920.60.1>],
                          [<32920.60.1>],
                          ['rabbit@ubuntu-c2'],
                          [{vhost,<<"/">>},
                           {name,<<"ha-all">>},
                           {pattern,<<".*">>},
                           {'apply-to',<<"all">>},
                           {definition,
                            [{<<"ha-mode">>,<<"all">>},
                             {<<"ha-sync-mode">>,<<"automatic">>}]},
                           {priority,0}],
                          [{<32920.61.1>,<32920.60.1>},
                           {<0.32213.0>,<0.32209.0>}],
                          [],live},
                         none,true,rabbit_mirror_queue_master,
                         {state,
                          {resource,<<"/">>,queue,<<"test_6">>},
                          <0.32213.0>,<0.32212.0>,rabbit_priority_queue,
                          {passthrough,rabbit_variable_queue,
                           {vqstate,
                            {0,{[],[]}},
                            {0,{[],[]}},
                            {delta,undefined,0,undefined},
                            {0,{[],[]}},
                            {0,{[],[]}},
                            1030,
                            {0,nil},
                            {0,nil},
                            {0,nil},
                            {qistate,
                             "/tmp/rabbitmq-test-instances/rabbit/mnesia/rabbit/queues/9H1TFV3HQW3DAMFDY3SQ4VVIN",
                             {{dict,0,16,16,8,80,48,
                               {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                []},
                               {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                 []}}},
                              []},
                             undefined,0,32768,
                             #Fun<rabbit_variable_queue.2.97317634>,
....
** Reason for termination ==
** {function_clause,
       [{gb_trees,delete_1,[233,nil],[{file,"gb_trees.erl"},{line,407}]},
        {gb_trees,delete,2,[{file,"gb_trees.erl"},{line,403}]},
        {rabbit_variable_queue,remove_pending_ack,3,
            [{file,"src/rabbit_variable_queue.erl"},{line,1993}]},
        {rabbit_variable_queue,remove_pending_ack,3,
            [{file,"src/rabbit_variable_queue.erl"},{line,1980}]},
        {rabbit_variable_queue,ack,2,
            [{file,"src/rabbit_variable_queue.erl"},{line,643}]},
        {rabbit_priority_queue,ack,2,
            [{file,"src/rabbit_priority_queue.erl"},{line,316}]},
        {rabbit_mirror_queue_master,ack,2,
            [{file,"src/rabbit_mirror_queue_master.erl"},{line,378}]},
        {rabbit_amqqueue_process,'-ack/3-fun-0-',2,
            [{file,"src/rabbit_amqqueue_process.erl"},{line,682}]}]}

dcorbacho commented 8 years ago

As in #944, partial partitions cause the coexistence of several masters in the same cluster. When the nodes get reconnected, the master exchange messages with existing slaves - expecting them to be newly started slaves - but those have just been synchronised or received messages from other master. Thus, message queues get out of sync and status do not match.

This requires an enhanced consensus algorithm to avoid the root cause.

michaelklishin commented 8 years ago

To make it clear, there are plans to at least evaluate Raft in a few places after the 3.7.0 release.

dcorbacho commented 8 years ago

The root cause is not changes in master/slave status as I originally thought, but the network partition causing remote channels (in a different node) to be removed from the queue. If a message has been delivered to a remote channel and immediately after the queue process receives a DOWN message from the channel, the messages pending acknowledgment for this channel are requeued. Thus, later delivered to a different channel.

If the other node comes back shortly after (a few seconds on my tests), the queue might end up receiving two acknowledgments for the same ack tag: one from the channel considered down and one from the redelivery channel. The second ack causes the crash as the tag cannot be found.

dcorbacho commented 8 years ago

Note that this is not exclusively related to HA queues, as it can happen without them.

rabbitmq / rabbitmq-server

Crash on tree handling in rabbit_variable_queue #960