rabbitmq / rabbitmq-server

Open source RabbitMQ: core server and tier 1 (built-in) plugins
https://www.rabbitmq.com/
Other
12.23k stars 3.91k forks source link

At-least-once Quorum Queue dead-lettering #3100

Closed kjnilsson closed 2 years ago

kjnilsson commented 3 years ago

As explained in the dead-lettering Safety section moving messages between queues using DLX has at-most-once delivery guarantees and thus cannot be considered safe. This violates the safety guarantees provided by quorum queues as messages can get lost if anything goes wrong during the move.

Moving messages between queues using the shovel can be safe when using (as are the defaults) both publisher and consumer acknowledgements.

Given the above observations we could model dead lettering inside quorum queues as a dedicated (qq internal) queue containing any discarded messages. This "discard queue" is consumed by a dedicated "discard consumer" that only receives messages that have been discarded. The consumer (which is a separate erlang process) works a bit like a special shovel that consumes discarded messages and re-routes them according to the dlx configuration for the queue. Once the process receives all publisher confirms for a given message it will ack the consumed discard message and thus ensuring that the message isn't removed until it has been safely delivered to the dlx target queue(s).

Even with this approach it is possible that a message doesn't receive all confirms needed to ack and remove the message from the source queue. For quorum queues this could cause excessive log growth as the source queue will need to retain the discarded message until it has been acked. To handle this case the forwarding processes would still need to ack the message after a given time and/or retried deliveries. To ensure the message isn't completely lost we could introduce a "trash can": node local stream where we write all messages that cannot be delivered to dlx target queues within some time frame.

To ensure availability of the consuming process we can spawn it as a companion process to the QQ leader thus ensuring that it is always available when there is a leader to process commands. If necessary we could later pool these processes if we don't want to add another one for each quorum queue but it may not be necessary as long as they do not set too large a prefetch and hibernate when idle.

edbyford commented 2 years ago

Duplicate of rabbitmq/data-plane#1. Keeping due to context in the tickets.

edbyford commented 2 years ago

Jepsen tests not given us enough confidence that messages are not being forwarded from source QQ to target QQ. Requires some workarounds as it stands.

Some edge scenarios (on deletion and recreation of target queues) similar issues occurring.

kjnilsson commented 2 years ago

This was done in #3121