rabbitmq / rabbitmq-consistent-hash-exchange

RabbitMQ Consistent Hash Exchange Type
https://www.rabbitmq.com/
Other
208 stars 31 forks source link

Queue deletion and binding removal can affect hash ring state consistency #40

Closed lukas-phaf closed 5 years ago

lukas-phaf commented 5 years ago

In RabbitMQ 3.7.9, when I use certain bindings weights (e.g. 34, 100) when binding a queue, routing of message can fail, and I get the following type of errors in the RMQ log:

2019-01-07 10:22:38.768 [warning] <0.699.0> Bucket 6 not found

How to reproduce

The problem can be reproduced with a slightly modified version of Python tutorial 5 (https://www.rabbitmq.com/tutorials/tutorial-five-python.html). The only change is that the exchange_type is changed to 'x-consistent-hash' in both emit and receive code. I will add the full code at the bottom of this post.

To reproduce, do the following in the 'receive' shell:

python receive_logs_topic.py 34
<Ctrl-C>
python receive_logs_topic.py 34

So, on the receive side, we start the receive program (which binds an exclusive queue), stop the program, and restart (which binds a new exclusive queue). Notice that everything works fine if the receive program is not restarted. The deletion of the queue is what causes the problem. Now, on the 'emit' shell:

python emit_logs_topic.py key value

The receive shell does not receive the message, and a warning is produced in the log.

I have used a binding weight of 100 successfully in the past on older versions of RMQ (e.g. 3.5.7). I suspect the problem is related to the changes made for #37. The weight of 34 is the lowest weight that seems to give this problem.

Additional information

emit_logs_topic.py:

#!/usr/bin/env python
import pika
import sys

connection = pika.BlockingConnection(pika.ConnectionParameters(
        host='localhost'))
channel = connection.channel()

channel.exchange_declare(exchange='topic_logs',
                         exchange_type='x-consistent-hash')

routing_key = sys.argv[1] if len(sys.argv) > 2 else 'anonymous.info'
message = ' '.join(sys.argv[2:]) or 'Hello World!'
channel.basic_publish(exchange='topic_logs',
                      routing_key=routing_key,
                      body=message)
print(" [x] Sent %r:%r" % (routing_key, message))
connection.close()

receive_logs_topic.py

#!/usr/bin/env python
import pika
import sys

connection = pika.BlockingConnection(pika.ConnectionParameters(
        host='localhost'))
channel = connection.channel()

channel.exchange_declare(exchange='topic_logs',
                         exchange_type='x-consistent-hash')

result = channel.queue_declare(exclusive=True)
queue_name = result.method.queue

binding_keys = sys.argv[1:]
if not binding_keys:
    sys.stderr.write("Usage: %s [binding_key]...\n" % sys.argv[0])
    sys.exit(1)

for binding_key in binding_keys:
    channel.queue_bind(exchange='topic_logs',
                       queue=queue_name,
                       routing_key=binding_key)

print(' [*] Waiting for logs. To exit press CTRL+C')

def callback(ch, method, properties, body):
    print(" [x] %r:%r" % (method.routing_key, body))

channel.basic_consume(callback,
                      queue=queue_name,
                      no_ack=True)

channel.start_consuming()
michaelklishin commented 5 years ago

Thank you for your time.

Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. GitHub issues are not used for questions, investigations, root cause analysis, discussions of potential issues, etc (as defined by this team).

We get at least a dozen of questions through various venues every single day, often light on details. At that rate GitHub issues can very quickly turn into a something impossible to navigate and make sense of even for our team. Because GitHub is a tool our team uses heavily nearly every day, the signal/noise ratio of issues is something we care about a lot.

Please post this to rabbitmq-users.

Thank you.

michaelklishin commented 5 years ago

The latest revision of this plugin has been pretty extensively tested. I will try to reproduce but as most questions around this plugin, they likely scenario-specific or stem from misunderstanding around how it works.

Weights of 34 an 100 are rarely necessary, as the docs explain.

RabbitMQ tutorials are intentionally very basic and most depend on application startup order (consumers must be started first). We do have a couple of scripts here but none of the rest of the information.

Please move this to the mailing list.

michaelklishin commented 5 years ago

The script above creates an exclusive queue and when it is stopped, the queue is deleted. This scenario was originally problematic in #37 due to transaction retries in the core that left this plugin no chance of updating its ring state. There are tests (in the integration suite and other) that do exactly that and they pass.

I managed to reproduce the warning with exclusive queues but it expectedly never happens with non-exclusive durable ones since there is no binding deletion concurrency. Weights or the implementation of this plugin as of #39 are not problematic. This comment explains what's really going on. We'll have to investigate first and 3.7.10 won't be delayed because of this.

michaelklishin commented 5 years ago

This test seems to do exactly what shutting down the consumer above does.

I'm adding a variation of that test that uses a higher weight (say, 50). In case @dcorbacho and I uncover something we will reopen this or file a new issue.

lukas-phaf commented 5 years ago

I tried with non-exclusive queue (removed the 'exclusive' when declaring the queue on line 12 of receive_logs_topic.py). The same problem occurs if you delete the queue manually.

The instruction to reproduce are now:

python receive_logs_topic.py 34
<Ctrl-C>
-> manually delete the queue using the RabbitMQ admin website.
python receive_logs_topic.py 34

A weight of 50 works fine, please do the tests with the numbers I have identified as problematic (34 or 100).

I understand that high weights are (no longer?) necessary, but I thought it was a good idea to report so that other people don't hit the same issue.

michaelklishin commented 5 years ago

Take a look at how the ring state is managed when a binding is removed. I find it hard to believe that there are magic weight values. We will add more cases.

michaelklishin commented 5 years ago

@lukas-phaf we have a candidate fix. The tests with weights of 34, 50, 100 and greater numbers seem to pass with it. If I upload a plugin build, would you be able to give it a shot to confirm? Simply replace the plugin's archive (.ez) in your RabbitMQ distribution (I suggest using a one-off binary build).

lukas-phaf commented 5 years ago

I am willing to give it a try, but not clear on what you are asking me to do. So far, I have only installed RabbitMQ from RPM. Can you point me to instruction how to do a one-off binary build?

michaelklishin commented 5 years ago

@lukas-phaf

  1. Download a binary build
  2. Start it with ./sbin/rabbitmq-server in the foreground
  3. Reproduce the issue
  4. Delete plugins/rabbitmq_consistent_hash_exchange-3.7.9.ez, copy a file I will upload shortly to the same directory (thus replacing the plugin).
  5. Stop the node, start it back
  6. Try to reproduce the issue
michaelklishin commented 5 years ago

Below is a zip-compressed .ez file (GitHub doesn't support .ez file uploads). Please give it a shot, @dcorbacho and I have concluded that it's a sufficient fix. Thank you!

rabbitmq_consistent_hash_exchange-40.g392d964.ez.zip

lukas-phaf commented 5 years ago

@michaelklishin

I confirm that this fixes it for me as well, for both the exclusive and non-exclusive queue case. I tried with weights of 34 and 100, and saw no issues.

Thanks for the quick response!

michaelklishin commented 5 years ago

@lukas-phaf OK, it will be included into 3.7.10 (due any time now) after we do some test suite refactoring. Thank you for reporting this with enough details to quickly reproduce.

michaelklishin commented 5 years ago

3.7.10 is out and includes a fix.