[Regression in 4.0.0] RabbitMq consumer does not reconnect to non-durable queues when rabbit crashes

oscarfh commented 3 years ago

This is the result of further research of from https://github.com/eclipse-vertx/vert.x/issues/3726. At the end, there is the original post for completion sake.

After migrating to Vertx 4, a test that verified that the producers and consumers could reestablish connection after a RabbitMQ failure broke.

These are my findings:

This and this changes now require you to use the RabbitMQOptions#setReconnectAttempts method in order to enable retry, which was not the case before (before, using RabbitMQOptions#setAutomaticRecoveryEnabled was enough. This seems to be a breaking change that was not documented.
The new method RabbitMQClientImpl#restartConsumer changes the way reconnection happens. Unfortunately the new way only works with durable queues. If your queue is not durable, you will get a channel error; protocol method: #method<channel.close>(reply-code=404, reply-text=NOT_FOUND - no queue 'test_topi2' in vhost '/', class-id=60, method-id=20). This seems to be a bug, since the older version would recreate the queue.

I tried to fix the issue myself, but it seems that we do not have the required information at hand in RabbitMQClientImpl#restartConsumer and that it requires a deeper change. @YaytayAtWork, I saw that you were involved in the changes in this method. Can you help me understand why this was created? What didn't work in the previous approach? Is there any way we can reuse the com.rabbitmq.client.impl.recovery.AutorecoveringChannel class capabilities in order to reconnect?

Original Post:

We had a test in our system that simulates a connection disruption between our system and rabbitMq.
After migrating to vertx 4, the test is broken.
It passes with 3.9.5, but fails with all 4.0.0.X versions.

Here is the reproducer: https://github.com/oscarfh/vertx-rabbit-regression
The first commit is vertx 4 with the failing test, the second is vertx 3.9.5 with the same testing passing.

Explaining the test:

    The test starts up a rabbitmq docker container
    The test sets up a consumer and a producer
    Both will be configured to reestablish the connection to rabbitmq in case of failures
    The producer will send messages periodically (every 1s)
    The consumer will receive the messages. Upon the reception of the first message, RabbitMq docker container is shutdown and then started up again.
    Upon the reception of a second message, which happens after rabbit was restarted, the test completes

Notice how the producer will complain about connection errors until you see the message Container rabbitmq:3.8.6-alpine started in XXX, then it will stop logging the error messages (but you can change the success message log level or add a CLI consumer to check that it is still sending) but the test will not complete.

Downgrade to 3.9.5 and see that the test completes successfully.

/cc @vietj @jekkel @pravussum

YaytayAtWork commented 3 years ago

There were circumstance in which I couldn't get the Autorecover Channels to bring things back to the correct state at the right time. I'm not sure what they were now, I think to do with multiple clients coming back to a replaced (clean) server introducing race conditions that I couldn't work around.

When I did my work on the rabbit publisher the default value for reconnect attempts was adequate to make it retry, but the RabbitMqOptions was subsequently changed to derive from NetClientOptions. I think that was a mistake, because I think the vertx generators are broken (#112). The workaround is to ensure that there is an adequately large value for the reconnect attempts.

The new way of doing reconnects requires the caller to use addConnectionEstablishedCallback to do whatever it wants to the connection before it is used. If you have to declare rabbit objects it has to be done in a connection established callback.

It should be possible to ignore the retrying that vertx adds (by having reconnectAttempts = 0 and automaticRecoveryEnabled = true). That should fall back to just using what the Java client provides. This may leave you with a race conditions when the auto recovery fails to reconnect, but should be equivalent to what vert.x 3 did. From what you are saying this isn't the case, so I agree that that's a regression.

In your test are you buffering the messages that you are trying to send whilst the server is down, or accepting that they are lost? There is a unit test (RabbitMQClientPublisherTest.java) that does effectively the same as your repro, but using the new RabbitMQPublisher (and thus using a connection established callback) to buffer messages client side until the server accepts them. There shouldn't be any need to use the RabbitMQPublisher, but that's what it's for.

YaytayAtWork commented 3 years ago

Been digging around a bit with your repro, and found two things:

My thought that you'd get the basic Java client behaviour to recreate your rabbit objects doesn't work because the RabbitMqClientOptions doesn't support TopologyRecovery. We should add that.
That means that I don't currently know any more how the v3 recovery worked.

I'll try to find time for more digging tomorrow.

YaytayAtWork commented 3 years ago

After a bit more digging:

The RabbitMQ java client library auto reconnect does not work if the RabbitMQ server is shut down nicely. In this situation the client detects the dropped TCP connection and disconnects even if auto recovery is enabled. Your test works around this by inserting the toxiproxy so the TCP connection from the client is not interrupted (i.e. if the proxy is removed your test fails on both 3.9.5 and 4.0.0).
When the client detects a network problem it makes a callback to inform clients, even if auto recovery would work. At this point the current 4.0.0 RabbitMQClient shoots itself in the head in order to do a clean reconnect, which clearly prevents auto recovery from doing anything useful.
I don't think there is any way to know from the callbacks whether auto recovery would work or not. These two points means that auto recovery is fundamentally incompatible with my restart code (which is more brutal) works in more circumstance than auto recovery, but needs more work from the user because you need to implement the connectionEstablishedCallback.
It's easy to enable the RabbitMQClient to work with auto-recovery instead of my restart (not as well as) by having it only do a full restart if the retry count is greater than 0.
The adocs need to cover this.

I'll do a PR (code done, need to update docs). I don't know what to do about notifying breaking changes, but also I think 4.0.1 should be compatible with 3.9.5, though 4.0.0 isn't.

oscarfh commented 3 years ago

@YaytayAtWork Thanks a lot! Only now I had time to read your messages. I think your questions from the first post are now outdated. Let me know if they are still important. One question from your PR: if I understood right, your approach for recovery requires the connectionEstablishedCallback to be declared so the queue is recreated and re-bound. Is this correct?

YaytayAtWork commented 3 years ago

Yes, and that is correct.

oscarfh commented 3 years ago

Ok, thanks a lot. I will wait for your PR to get merged.

oscarfh commented 3 years ago

@YaytayAtWork I just noticed that your PR was merged. Thanks a lot. What is the process in this project? When is this going to be available?

jekkel commented 3 years ago

To me it seems the PR got closed without a merge.

@YaytayAtWork has this been done by accident? Is the issue fixed due to other changes?

YaytayAtWork commented 3 years ago

You are right, I'll redo the PR.

oscarfh commented 3 years ago

@YaytayAtWork @vietj any chances this makes into 4.0.3? I just tried 4.0.1 (because it is marked as milestone 4.0.1) but it does not work.

jekkel commented 3 years ago

@oscarfh could you retry with your reproducer so we con confirm this ticket can be closed?

oscarfh commented 3 years ago

@jekkel I was waiting for the 4.0.3 release, but I will do that using the commit hash.

oscarfh commented 2 years ago

I am very sorry for the delay, I completely forgot it. I think this issue is still present (I had it in another project). I will update the reproducer.

oscarfh commented 2 years ago

@vietj Yes, I can confirm this is still an issue. Setting

rabbitMQOptions.setAutomaticRecoveryEnabled(false);
rabbitMQOptions.setReconnectAttempts(Integer.MAX_VALUE);
rabbitMQOptions.setReconnectInterval(500);

as per doc, will generate

Caused by: com.rabbitmq.client.ShutdownSignalException: channel error; protocol method: #method<channel.close>(reply-code=404, reply-text=NOT_FOUND - no queue 'test_topic' in vhost '/', class-id=60, method-id=20)
    at com.rabbitmq.client.impl.ChannelN.asyncShutdown(ChannelN.java:517)
    at com.rabbitmq.client.impl.ChannelN.processAsync(ChannelN.java:341)
    at com.rabbitmq.client.impl.AMQChannel.handleCompleteInboundCommand(AMQChannel.java:182)
    at com.rabbitmq.client.impl.AMQChannel.handleFrame(AMQChannel.java:114)
    at com.rabbitmq.client.impl.AMQConnection.readFrame(AMQConnection.java:739)
    at com.rabbitmq.client.impl.AMQConnection.access$300(AMQConnection.java:47)
    at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:666)
    ... 1 common frames omitted

Trying to configure anything on rabbitMQClient.getDelegate().addConnectionEstablishedCallback() will make the consumer not to start.

ps: I updated the reproducer in https://github.com/oscarfh/vertx-rabbit-regression to show the issue

Yaytay commented 2 years ago

The reconnections are behaving as expected - with automatic recovery disabled and reconnects enabled you MUST use the connectionEstablishedCallback in order to recreate your objects. So the question is, what are you doing there that prevents the client starting? Can you provide an example?

oscarfh commented 2 years ago

@Yaytay The Rxified version of the client does not provide such method, so I have to use getDelegate(). Adding rabbitMQClient.getDelegate().addConnectionEstablishedCallback(event -> /* your code here, or empty */) will cause the rxStart not to complete.

~I updated the reproducer. If you run the test, you will see the message "Connection established", but not "Declaring exchange" (and the test is blocked waiting for the consumer to connect). Comment the addConnectionEstablished line and you will see the "Declaring Exchange" message along with tons of other log messages from the test.~

I got it to work. I was missing the event.complete() call. thanks a lot.

Yaytay commented 2 years ago

Excellent.

oscarfh commented 2 years ago

@Yaytay One question that rose within my team: is it expected that this method is not exposed by rxified API? Or should I open a bug in the generator?

Yaytay commented 2 years ago

It should definitely be exposed.

oscarfh commented 2 years ago

Ok, I will create the bug ticket, thanks

vert-x3 / vertx-rabbitmq-client

[Regression in 4.0.0] RabbitMq consumer does not reconnect to non-durable queues when rabbit crashes #116

Original Post: