Closed dcorbacho closed 8 years ago
As in #944, partial partitions cause the coexistence of several masters in the same cluster. When the nodes get reconnected, the master exchange messages with existing slaves - expecting them to be newly started slaves - but those have just been synchronised or received messages from other master. Thus, message queues get out of sync and status do not match.
This requires an enhanced consensus algorithm to avoid the root cause.
To make it clear, there are plans to at least evaluate Raft in a few places after the 3.7.0
release.
The root cause is not changes in master/slave status as I originally thought, but the network partition causing remote channels (in a different node) to be removed from the queue. If a message has been delivered to a remote channel and immediately after the queue process receives a DOWN
message from the channel, the messages pending acknowledgment for this channel are requeued. Thus, later delivered to a different channel.
If the other node comes back shortly after (a few seconds on my tests), the queue might end up receiving two acknowledgments for the same ack tag: one from the channel considered down
and one from the redelivery channel. The second ack causes the crash as the tag cannot be found.
Note that this is not exclusively related to HA queues, as it can happen without them.
Found while testing #944, using HA queues and autoheal (same testing as for #914).