Closed lukas-phaf closed 5 years ago
Thank you for your time.
Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. GitHub issues are not used for questions, investigations, root cause analysis, discussions of potential issues, etc (as defined by this team).
We get at least a dozen of questions through various venues every single day, often light on details. At that rate GitHub issues can very quickly turn into a something impossible to navigate and make sense of even for our team. Because GitHub is a tool our team uses heavily nearly every day, the signal/noise ratio of issues is something we care about a lot.
Please post this to rabbitmq-users.
Thank you.
The latest revision of this plugin has been pretty extensively tested. I will try to reproduce but as most questions around this plugin, they likely scenario-specific or stem from misunderstanding around how it works.
Weights of 34 an 100 are rarely necessary, as the docs explain.
RabbitMQ tutorials are intentionally very basic and most depend on application startup order (consumers must be started first). We do have a couple of scripts here but none of the rest of the information.
Please move this to the mailing list.
The script above creates an exclusive queue and when it is stopped, the queue is deleted. This scenario was originally problematic in #37 due to transaction retries in the core that left this plugin no chance of updating its ring state. There are tests (in the integration suite and other) that do exactly that and they pass.
I managed to reproduce the warning with exclusive queues but it expectedly never happens with non-exclusive durable ones since there is no binding deletion concurrency. Weights or the implementation of this plugin as of #39 are not problematic. This comment explains what's really going on. We'll have to investigate first and 3.7.10 won't be delayed because of this.
This test seems to do exactly what shutting down the consumer above does.
I'm adding a variation of that test that uses a higher weight (say, 50). In case @dcorbacho and I uncover something we will reopen this or file a new issue.
I tried with non-exclusive queue (removed the 'exclusive' when declaring the queue on line 12 of receive_logs_topic.py). The same problem occurs if you delete the queue manually.
The instruction to reproduce are now:
python receive_logs_topic.py 34
<Ctrl-C>
-> manually delete the queue using the RabbitMQ admin website.
python receive_logs_topic.py 34
A weight of 50 works fine, please do the tests with the numbers I have identified as problematic (34 or 100).
I understand that high weights are (no longer?) necessary, but I thought it was a good idea to report so that other people don't hit the same issue.
Take a look at how the ring state is managed when a binding is removed. I find it hard to believe that there are magic weight values. We will add more cases.
@lukas-phaf we have a candidate fix. The tests with weights of 34, 50, 100 and greater numbers seem to pass with it. If I upload a plugin build, would you be able to give it a shot to confirm? Simply replace the plugin's archive (.ez) in your RabbitMQ distribution (I suggest using a one-off binary build).
I am willing to give it a try, but not clear on what you are asking me to do. So far, I have only installed RabbitMQ from RPM. Can you point me to instruction how to do a one-off binary build?
@lukas-phaf
./sbin/rabbitmq-server
in the foregroundplugins/rabbitmq_consistent_hash_exchange-3.7.9.ez
, copy a file I will upload shortly to the same directory (thus replacing the plugin).Below is a zip-compressed .ez file (GitHub doesn't support .ez file uploads). Please give it a shot, @dcorbacho and I have concluded that it's a sufficient fix. Thank you!
@michaelklishin
I confirm that this fixes it for me as well, for both the exclusive and non-exclusive queue case. I tried with weights of 34 and 100, and saw no issues.
Thanks for the quick response!
@lukas-phaf OK, it will be included into 3.7.10 (due any time now) after we do some test suite refactoring. Thank you for reporting this with enough details to quickly reproduce.
3.7.10 is out and includes a fix.
In RabbitMQ 3.7.9, when I use certain bindings weights (e.g. 34, 100) when binding a queue, routing of message can fail, and I get the following type of errors in the RMQ log:
How to reproduce
The problem can be reproduced with a slightly modified version of Python tutorial 5 (https://www.rabbitmq.com/tutorials/tutorial-five-python.html). The only change is that the exchange_type is changed to 'x-consistent-hash' in both emit and receive code. I will add the full code at the bottom of this post.
To reproduce, do the following in the 'receive' shell:
So, on the receive side, we start the receive program (which binds an exclusive queue), stop the program, and restart (which binds a new exclusive queue). Notice that everything works fine if the receive program is not restarted. The deletion of the queue is what causes the problem. Now, on the 'emit' shell:
The receive shell does not receive the message, and a warning is produced in the log.
I have used a binding weight of 100 successfully in the past on older versions of RMQ (e.g. 3.5.7). I suspect the problem is related to the changes made for #37. The weight of 34 is the lowest weight that seems to give this problem.
Additional information
emit_logs_topic.py:
receive_logs_topic.py