rebus-org / Rebus.RabbitMq

:bus: RabbitMQ transport for Rebus
https://mookid.dk/category/rebus
Other
65 stars 45 forks source link

RabbitMQ failover during startup #98

Closed dmeagor closed 2 years ago

dmeagor commented 2 years ago

When our server reboots the application fails as rebus is trying to connect before rabbitmq has started. This is a permanent failure that required a developer to manually restart every app pool to get things working again. This issue did not crop up when testing but now that it's on our production systems has become a critical flaw causing outages when the server is auto-patched or rebooted.

We are using the ABP.io framework abstraction when using rebus which limits what we can do from our end to correct this. I did ask them for a fix but they've replied that this is a known rebus issue and they seem uninterested in creating a workaround.

Is there any way rebus can be told to retry every 2 secs for 2 mins or something?

mookid8000 commented 2 years ago

It's actually quite intentional that Rebus lets exceptions bubble through if it's unable to start. Most times, it will be because of a wrong hostname in a connection string, a totally wrong connection string, a closed firewall, etc., and those are often things you want to know right away.

But I can also see how a little bit of robustness in some scenarios like yours could be beneficial. Wouldn't it be possible for you to enable some resilience in your hosting environment so that the process is started more than once? This is usually how I see people get over these things: By configuring their containers to restart unless explicitly stopped, or their Windows Services to automatically restart after having crashed, etc.

dmeagor commented 2 years ago

Hi

It's not a service or docker container. Just a standard IIS website/app. I'm not aware of any option to auto-flush the app pool if an asp core Startup class fails.

Can you confirm that is is something the people who develop the ABP framework should fix through a try/delay/retry loop or some other mechanism? I will link this issue to our commercial support ticket with have open with them.

mookid8000 commented 2 years ago

(...) the ABP framework (...)

I am not familiar with ABP, so I have no idea what it's using Rebus for, and how it's doing that. But yes, I would say it should either add some resiliency with a timeout or simply let errors bubble out, so whatever orchestrator is activating the app can restart it.

zlepper commented 1 year ago

We are having the same issue, and is using Rebus directly. Even if we disable the automatic startup of the bus, and try to handle it manually with retries on our side, the RabbitMQ transport still tries to connect to RabbitMQ to create the input queue, even if no workers are actually running, which makes it impossible for us to do any retrying on our side. And just like OP, we are stuck in IIS, which can't do retries on the hosting level properly. Can the transport be updated to actually respect delayed start of the transport, so we can do retrying in our code?

mookid8000 commented 1 year ago

I've just released Rebus.RabbitMQ 7.4.4, which has the tiny tweak that it completely avoids creating a RabbitMQ connection during initialization, if there's no need for it.

This means that you can disable declarations by going

services.AddRebus(
    configure => configure
        .Transport(t => t.UseRabbitMq(connectionString, "whatever")
            .Declarations(
                declareExchanges: false,
                declareInputQueue: false,
                bindInputQueue: false
            ))
);

and then completely defer establishing a connection to RabbitMQ.

This of course puts a burden on you, as you will need to make the appropriate exchange and queue declarations, followed by the correct input queue binding. I recommend you let Rebus do it in a controlled manner (possibly locally), and then you inspect the created RabbitMQ entities in the management console to see it's supposed to look.

This should work as a temporary solution. I'll create an issue for looking into putting the appropriate bits in a Lazy<> or something like that – either at the RabbitMQ level, or at the Rebus level.

I hope this will work out for you 🙂