rabbitmq / rabbitmq-server

Open source RabbitMQ: core server and tier 1 (built-in) plugins
https://www.rabbitmq.com/
Other
12.28k stars 3.91k forks source link

Quorum queue replica that was shut down can rejoin after a restart and queue deletion, re-declaration #12366

Open luos opened 1 month ago

luos commented 1 month ago

Describe the bug

Hi,

The issue below involves deleting and recreating a queue while a node is down, which means that most users will not be affected by this.

We've identified an issue with Quorum Queues which causes an out of date replica to come back as a leader again, resending past log messages, causing the now follower to reapply local effects, causing the new consumer to receive messages which were already processed.

This leads to duplicate message delivery - even though the messages were acknowledged properly and the queue processed the acks, etc. Basically the log will be replayed in its entirety, meaning messages processed days ago can reappear.

The effect of it is similar to https://github.com/rabbitmq/ra/issues/387.

This issue causes the queue to actually become broken in some scenarios, but that is expected due to the bad internal state.

We know that the proper solution is to not delete the queue but probably ra should also have some built in protection to not allow out of date members to rejoin the cluster - at least not to become leaders.

I think a potential solution would be is to include cluster id in pre_vote and request_vote_rpc messages. According to my review, today there is no shared cluster ID for the ra clusters. There is a uid but that is for the server, not for the cluster.

Reproduction steps

  1. Use a three node cluster.
  2. Connect to "rmq1" with a consumer
  3. Create a quorum queue named “test” on "rmq1"
  4. Create consumer on "rmq1" for queue "test"
  5. Publish a single message with a unique identifier (eg. current time)
  6. Acknowledge the message on the consumer
  7. Shut down rmq1, client is disconnected
  8. Client reconnects to one of the up nodes (rmq2)
  9. Client deletes and recreates the queue
  10. Creates a consumer for queue "test"
  11. Restart down node "rmq1"
  12. The queue starts up on rmq1, notices that it is more up to date than the newly created replica on rmq2, leader becomes rmq1, other nodes revert back to follower
  13. RMQ1 notices that the (newly created) followers do not have some indexes,
    1. resends the append_entries for these log items
    2. log message "setting last index to 3, next_index 4 for…"
  14. Follower receives the entries from the log, and plays them with the bad initial state, meaning it will send out the message to the current local consumer.

Expected behavior

One or all of the following: :-)

Additional context

I can share some traces or debug output, not sure it makes sense without context.

Attached the "restart sequence", nothing special.

restart.sh.txt

kjnilsson commented 1 month ago

You'd have to start recording each member's assigned "UId" in the queue record and base the recovery of the member on whether the current UId for the given cluster name matches or not.

kjnilsson commented 1 month ago

even so you could reproduce a similar issue by partitioning a node, delete and re-create the queue on the majority side then re-join the partitioned node.

luos commented 1 month ago

I see, in the case you are proposing it's more of RabbitMQs responsibility to recover / not recover the member if the uid changed.

I think it's a bit more resilient if this would be included in ra, ie. pre_vote would check membership.

Though thinking more about it - both sides are needed, and more probably.

One for RabbitMQ to clean up / not start removed members on startup, an implementation in ra to not allow the partitioned node to become a leader again, and another where RabbitMQ gets notified if an out of date uid member shows up, so it can do the cleanup.

kjnilsson commented 1 month ago

The uids aren't exchanged in the raft commands so changing Ra would not be easy to do. It is better to put the responsibility to ensure the right members are running on the system running them if at all possible.