Consistent hash exchange bindings are changed and break when a node restarts

luos commented 3 years ago

Hi,

When a single node is restarted out of a cluster the consistent hash exchange will add existing bindings again into the ring.

Reproduction:

Create 3 node cluster
Create a consistent hash exchange
Bind a queue to it with any weight (for example 1)
Stop a node (with stop_app or anything)

Output:

root@rmq1:/# rabbitmqctl eval 'ets:tab2list(rabbit_exchange_type_consistent_hash_ring_state).'
[{chx_hash_ring,{resource,<<"/">>,exchange,<<"test-exchange">>},
                #{13 => {resource,<<"/">>,queue,<<"test-qq">>}},
                14}]
root@rmq1:/# rabbitmqctl eval 'ets:tab2list(rabbit_exchange_type_consistent_hash_ring_state).'^C
root@rmq1:/# rabbitmqctl start_app^C
root@rmq1:/# rabbitmqctl stop_app && rabbitmqctl start_app
Stopping rabbit application on node rabbit@rmq1.monitored.host ...
Starting node rabbit@rmq1.monitored.host ...
root@rmq1:/# rabbitmqctl eval 'ets:tab2list(rabbit_exchange_type_consistent_hash_ring_state).'
[{chx_hash_ring,{resource,<<"/">>,exchange,<<"test-exchange">>},
                #{13 => {resource,<<"/">>,queue,<<"test-qq">>},
                  14 => {resource,<<"/">>,queue,<<"test-qq">>},
                  15 => {resource,<<"/">>,queue,<<"test-qq">>}},
                16}]
root@rmq1:/# rabbitmqctl stop_app && rabbitmqctl start_app
Stopping rabbit application on node rabbit@rmq1.monitored.host ...
Starting node rabbit@rmq1.monitored.host ...
root@rmq1:/# rabbitmqctl eval 'ets:tab2list(rabbit_exchange_type_consistent_hash_ring_state).'
[{chx_hash_ring,{resource,<<"/">>,exchange,<<"test-exchange">>},
                #{13 => {resource,<<"/">>,queue,<<"test-qq">>},
                  14 => {resource,<<"/">>,queue,<<"test-qq">>},
                  15 => {resource,<<"/">>,queue,<<"test-qq">>},
                  16 => {resource,<<"/">>,queue,<<"test-qq">>},
                  17 => {resource,<<"/">>,queue,<<"test-qq">>}},
                18}]
root@rmq1:/# rabbitmqctl stop_app && rabbitmqctl start_app
Stopping rabbit application on node rabbit@rmq1.monitored.host ...
Starting node rabbit@rmq1.monitored.host ...
root@rmq1:/# rabbitmqctl eval 'ets:tab2list(rabbit_exchange_type_consistent_hash_ring_state).'
[{chx_hash_ring,{resource,<<"/">>,exchange,<<"test-exchange">>},
                #{13 => {resource,<<"/">>,queue,<<"test-qq">>},
                  14 => {resource,<<"/">>,queue,<<"test-qq">>},
                  15 => {resource,<<"/">>,queue,<<"test-qq">>},
                  16 => {resource,<<"/">>,queue,<<"test-qq">>},
                  17 => {resource,<<"/">>,queue,<<"test-qq">>},
                  18 => {resource,<<"/">>,queue,<<"test-qq">>},
                  19 => {resource,<<"/">>,queue,<<"test-qq">>}},
                20}]

After this cleaning up the bindings and recreating them breaks the exchange:

Remove all bindings from the queue / exchange
Add a binding to a queue
Try to publish messages

rmq1.monitored.host | 2021-09-08 12:52:49.746842+00:00 [warn] <0.32348.1> Bucket 6 not found
rmq1.monitored.host | 2021-09-08 12:52:53.163576+00:00 [warn] <0.32375.1> Bucket 6 not found

michaelklishin commented 3 years ago

I only see two options:

Make hash ring additions idempotent
Assume that if the table can be sync, we can use its existing data (this may require using disk nodes, which I don't know how to make upgradeable)

Binding recovery will happen on node boot regardless of what we do. Besides bindings themselves, exchange state is a really rare edge case that's not really handled in RabbitMQ.

Ideas and PRs would be appreciated, our team doesn't have the capacity to redesign this plugin any time soon.

luos commented 3 years ago

I think it would be fine to not apply the binding if we already have weight number of entries - though it makes it harder that you can bind the queue multiple times.

Maybe as a quick fix just resolving the counter(?) which goes wrong with the buckets would be fine so at least there is a queue where the message goes even if it's not consistent.

michaelklishin commented 3 years ago

We can ignore the fact that you can bind a queue multiple times as that's clearly something the binding recovery process and such a stateful exchange type cannot handle (we don't know what other bindings are going to be recovered down the line).

For this exchange type, multiple bindings likely is a result of a mistake. We can assume that if you have enough ring entries ("weight") for the recovered binding, we can do nothing.

FalconerTC commented 3 years ago

Very interested in a resolution to this issue! I'm seeing the same issue, where node restarts cause the exchange to become unusable, though I'm not seeing the binding count increase.

FalconerTC commented 2 years ago

Turns out I didn't really resolve this in https://github.com/rabbitmq/rabbitmq-server/pull/3594... When bindings are copied over when a new node comes up, they aren't sorted anymore so remove_binding doesn't properly adjust

An example after node removal:

I have no name!@rabbitmq-6:/$ rabbitmqctl eval 'ets:tab2list(rabbit_exchange_type_consistent_hash_ring_state).'
 {chx_hash_ring,{resource,<<"/">>,exchange,<<"order-group-action">>},
                #{0 =>
                      {resource,<<"/">>,queue,
                                <<"order-group-action-panda-668d997c5-pzn4k">>},
                  1 =>
                      {resource,<<"/">>,queue,
                                <<"order-group-action-panda-668d997c5-kt7v5">>},
                  2 =>
                      {resource,<<"/">>,queue,
                                <<"order-group-action-panda-668d997c5-kt7v5">>},
                  3 =>
                      {resource,<<"/">>,queue,
                                <<"order-group-action-panda-668d997c5-pzn4k">>},
                  4 =>
                      {resource,<<"/">>,queue,
                                <<"order-group-action-panda-668d997c5-kt7v5">>},
                  5 =>
                      {resource,<<"/">>,queue,
                                <<"order-group-action-panda-668d997c5-pzn4k">>},
                  6 =>
                      {resource,<<"/">>,queue,
                                <<"order-group-action-panda-668d997c5-jkw6b">>}},
                7},

Which means the ring breaks when one of those bindings is removed

I have no name!@rabbitmq-6:/$ rabbitmqctl eval 'ets:tab2list(rabbit_exchange_type_consistent_hash_ring_state).'
 {chx_hash_ring,{resource,<<"/">>,exchange,<<"order-group-action">>},
                #{3 =>
                      {resource,<<"/">>,queue,
                                <<"order-group-action-panda-668d997c5-jkw6b">>}},
                4},

This also can happen by doing the following

Create binding1 with queue1
Create binding2 with queue2
Create binding3 with queue1 (with a different weight)
Delete binding1

I imagine this can be fixed with some sorting in remove_binding but as per

For this exchange type, multiple bindings likely is a result of a mistake

I wonder if it would be better to just reject new bindings with the same name? Any suggestion @michaelklishin ?

luos commented 2 years ago

I think that maybe rebinding should just ensure that there are as many buckets as the new binding and drop the old binding. What do you think?

I'd say rejecting a binding may be unexpected for some applications, though most probably for 99% of users it does not matter (and now it's broken anyway).

de1ayer commented 2 years ago

Hi all! I've experienced same issue last week ("Bucket not found" in log, no routing out of exchange) and wondering around for fix. #3594 is merged to 3.8.24 (we have 3.8.3 versioned cluster at the moment) so I may update to get partly fix.

What about adding new nodes which is the main way to scale cluster? Is it unsafe for 'x-consistent-hash' exchanges ?

caoheyang commented 2 years ago

Hi all! Is there have any updates?

Heyang

Regards

luos commented 2 years ago

Hi,

From my side, I'd prefer if the plugin rejects rebinding, that would simplify it, but it definitely needs some more thinking and implementation. :) I think node restarts then deleting the binding still break it. I have not tested but possibility it's OK when you do not delete bindings, keeping in mind, restarting will probably reshuffle the exchange.

luos commented 2 years ago

Hi,

I made some tests and here are my findings:

Restarting a node does not really break the exchange, however it adds more bindings as previously. This means that the messages will be reshuffled.
After a restart, deleting a binding will leave the exchange in an inconsistent state, losing messages.

I tested this on 3.9.13.

nunoguerreirorosa commented 2 years ago

Consistent hash exchange is the only option as far as I am aware in RabbitMQ to be able to maintain the order of dependent messages while scaling up the number of consumers. This plugin was also added in Amazon MQ but apparently with this particular bug it ends up rendering this plugin unusable. I would say it is quite a critical bug for anyone thinking on using this particular plugin, since a node restart within a cluster is quite a standard operation. Is there any way this can be prioritized in the roadmap @michaelklishin ?

stgolem commented 2 years ago

@FalconerTC is there any chance on fixing this issue? Or are there any known workarounds for this aside from full rebinding? We're stuck with this issue in our application also.

michaelklishin commented 2 years ago

@nunoguerreirorosa this is open source software, so you if want something "prioritized", dive in and provide a set of steps to reasonably reliably reproduce, then feel free to look into the root cause. Pressing others to "prioritize" something in a piece of software you get for free is not how open source software evolves.

Consistent hashing exchange, unlike every other "built-in" (or tier 1) exchange type, is stateful. Booting nodes would recover ring state while applications can already begin modifying the same state by binding or unbinding. Besides delaying client connection listener startup until after all plugins are enabled and some time passes, there is no way to avoid this that would not result in refused client operations.

@luos ring state recovery does not involve creating any bindings. The plugin uses a list of durable bindings to recompute the (transient) state and then adapts it as bindings are added or removed.

@stgolem we don't understand what the issue is or how to reproduce it. You are welcome to spend some time on that.

michaelklishin commented 2 years ago

I now see some investigations above 👍

Making binding addition idempotent sounds like the right thing to do (removal, of course, already is).

This is a behavior change so we either have to try to squeeze it into 3.10 or anger the SemVer lovers and ship it in a patch release (even though it would count as a bug fix to many).

michaelklishin commented 2 years ago

Other team members suggest that we cannot delay 3.10 over this, so it would have to go into a patch release at some point.

luos commented 2 years ago

I do not think that's a problem. It is a bug and now it is quite broken, I'd say it is not recommended to use in a clustered environment, so if it gets fixed at some point, it will be a bug fix.

stgolem commented 2 years ago

Thanks for your time. We have exactly same issue as others with this plugin in cluster environment and node restarts. We will gently wait for this issue to resolve in some next release. As a user i classify this as a bug.

SteamUpdate commented 2 years ago

Greetings! I have a case how to reproduce "bucket not found" error.

RMQ Version: 3.9.15
Erlang Version: 24.3.3

RMQ runs as a cluster of three nodes. One exchange and four queues has been created. 1 & 3 queues are located on node 2, 2 & 4 - on node 1. Four bindings with routing key "1" has been created also.

Check the ring status:

$ rabbitmq-diagnostics consistent_hash_exchange_ring_state test-hash
Inspecting consistent hashing ring state for exchange test-hash in virtual host '/'...
Ring index: 0, queue: 'test1'
Ring index: 1, queue: 'test2'
Ring index: 2, queue: 'test3'
Ring index: 3, queue: 'test4'

Send a message with routing key 101 in the exchange. The message has been routed to queue 'test1'.

Next, stop node 1 with rabbitmqctl stop_app command and check the ring status:

rabbitmq-diagnostics consistent_hash_exchange_ring_state test-hash
Inspecting consistent hashing ring state for exchange test-hash in virtual host '/'...
Ring index: 0, queue: 'test1'
Ring index: 1, queue: 'test3'

Next, start node 1 with rabbitmqctl start_app command and check the ring status:

rabbitmq-diagnostics consistent_hash_exchange_ring_state test-hash
Inspecting consistent hashing ring state for exchange test-hash in virtual host '/'...
Ring index: 0, queue: 'test1'
Ring index: 1, queue: 'test3'
Ring index: 2, queue: 'test1'
Ring index: 3, queue: 'test3'
Ring index: 4, queue: 'test2'
Ring index: 5, queue: 'test4'

Next, stop node 2 and check the ring status:

$ rabbitmq-diagnostics consistent_hash_exchange_ring_state test-hash
Inspecting consistent hashing ring state for exchange test-hash in virtual host '/'...
Ring index: 1, queue: 'test2'
Ring index: 2, queue: 'test4'

Next, start node 2 and check the ring status:

$ rabbitmq-diagnostics consistent_hash_exchange_ring_state test-hash
Inspecting consistent hashing ring state for exchange test-hash in virtual host '/'...
Ring index: 1, queue: 'test2'
Ring index: 2, queue: 'test4'
Ring index: 3, queue: 'test2'
Ring index: 4, queue: 'test4'
Ring index: 5, queue: 'test1'
Ring index: 6, queue: 'test3'

Finally, send a message with routing key 101 in the exchange.

Result: the message published, but not routed. Logs: 2022-04-20 13:20:04.078869+00:00 [warn] <0.24750.1> Bucket 0 not found

cc: @luos @michaelklishin

tronyx commented 2 years ago

Is this planned to be fixed in both 3.9.X and 3.10.X? Are there any recommendations on a quicker way to resolve this after container recreation other than simply removing and recreating the exchange?

michaelklishin commented 2 years ago

I have been looking at how we can make binding addition and deletion idempotent for this exchange type. I see only one simple way: to keep track of what (destination, type, routing key) have been added and if they are known, make the function do nothing.

This has one limitation. Code that repeatedly binds the same exchange and queue with different routing keys/weights is no longer going to work:

# in pseudocode
bind(chx, q1, weight: 1)
bind(chx, q1, weight: 3)

will lead to a single binding with "weight = 1". I find this acceptable but there may be some code that's going to break as a result, or at least the routing distribution will be affected. Quite frankly, I don't think the weights idea is worth the complexity anyway, and this may be our first step towards getting rid of it entirely for 3.11 or 4.0.

Trying to make hash ring updates idempotent is going to be too complex for me to agree to.

HoloRin commented 2 years ago

If it is the case that the effect of weights is indeterminate in the face of node restarts, then adopting a first wins approach does not seem unreasonable to me.

michaelklishin commented 2 years ago

The hash ring state record changes would require a schema database migration, which seems like an overkill here. Will try to make the function idempotent using the bindings/routes tables instead which are very stable.

michaelklishin commented 2 years ago

First attempt at relying on binding data to make hash ring state management idempotent failed. Because RabbitMQ will create bindings anyway, it would only create confusion.

A longer term solution would be to make this the first plugin that adopts Khepri, our Raft-based next generation schema data store.

In the short term I'll look into using the state of the ring itself and/or the good old distributed locking module in Erlang.

ansd commented 2 years ago

Thanks @SteamUpdate for the excellent reproduction steps in https://github.com/rabbitmq/rabbitmq-server/issues/3386#issuecomment-1103929292.

https://github.com/rabbitmq/rabbitmq-server/pull/5121 makes adding bindings idempotent and will therefore fix this issue.

As @michaelklishin already mentioned this comes with a slight breaking change for applications that previously relied upon adding duplicate bindings (i.e. bindings with the same source exchange and same destination queue but possibly different routing key).

rabbitmq / rabbitmq-server

Consistent hash exchange bindings are changed and break when a node restarts #3386