rebus-org / Rebus.RabbitMq

:bus: RabbitMQ transport for Rebus
https://mookid.dk/category/rebus
Other
62 stars 44 forks source link

Getting timeouts on publishing many events in parallel #113

Open Dylan-DutchAndBold opened 6 months ago

Dylan-DutchAndBold commented 6 months ago

We are having issues on our production systems where we utilise Rebus with RabbitMQ.

Problematic scenario

We have a connecting 3rd party system which posts events over HTTP to our service which will take it and publish an event for it using Rebus.

The 3rd party system fires around 100 HTTP calls to our system at once, and unfortunately this results in timeout errors from Rebus/RabbitMQ.

This should not be an uncommon scenario.

The exception

[2023-12-08 13:47:20Z] fail: Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware[1]
      An unhandled exception has occurred while executing the request.
System.TimeoutException: The operation has timed out.
   at RabbitMQ.Util.BlockingCell`1.WaitForValue(TimeSpan timeout)
   at RabbitMQ.Client.Impl.SimpleBlockingRpcContinuation.GetReply(TimeSpan timeout)
   at RabbitMQ.Client.Impl.ModelBase.ModelRpc(MethodBase method, ContentHeaderBase header, Byte[] body)
   at RabbitMQ.Client.Framing.Impl.Model._Private_ChannelOpen(String outOfBand)
   at RabbitMQ.Client.Framing.Impl.AutorecoveringConnection.CreateNonRecoveringModel()
   at RabbitMQ.Client.Framing.Impl.AutorecoveringConnection.CreateModel()
   at Rebus.RabbitMq.RabbitMqTransport.CreateChannel()
   at Rebus.Internals.WriterModelPoolPolicy.Create()
   at Rebus.Internals.ModelObjectPool.Get()
   at Rebus.RabbitMq.RabbitMqTransport.SendOutgoingMessages(IEnumerable`1 outgoingMessages, ITransactionContext context)
   at Rebus.Transport.AbstractRebusTransport.<>c__DisplayClass3_1.<<Send>b__1>d.MoveNext()

Sample project for reproduction

We have setup a sample project which can reproduce this error. The test scenario needs a little more than 100 simultaneous request to fail on my local system so I have set it to 1000. The failure will unfortunately only occur when in a similar scenario as our production system. Meaning it is in the context of an HTTP call being handled by .NET.

We tried to reproduce the error more isolated without being in an HTTP context, but this will not make it fail with the timeout. However, these tests will still show that publishing 1000 messages in parallel will take a very long time to complete. Too long if compared to a similar library (MassTransit) which takes ~ 2 seconds as where Rebus will take ~ 40 seconds to complete.

https://github.com/Dylan-DutchAndBold/demonstrate-rebus-timeout-issue

Version information

Software Version
Rebus 9.0.1
Rebus.ServiceProvider 10.0.0
Rebus.RabbitMq 9.0.1
RabbitMQ 3.12.10
.NET 7
mookid8000 commented 6 months ago

Thanks for your detailed report and repro. I will have time to check it out tonight. Meanwhile, could you tell me which delivery guarantee you are using with MassTransit when you get it to send 100 messages in 2 s? Does it use publisher confirms?

Dylan-DutchAndBold commented 6 months ago

Thanks for your detailed report and repro. I will have time to check it out tonight. Meanwhile, could you tell me which delivery guarantee you are using with MassTransit when you get it to send 100 messages in 2 s? Does it use publisher confirms?

You're very welcome, thank you for taking the time!

We have left Rebus and Masstransit at all defaults in the test cases. And what I can find for Masstransit is that it does have publish confirms enabled by default https://masstransit.io/documentation/configuration/transports/rabbitmq#host-configuration

The Masstransit variant is included in the demonstration project. You can run the unit test for Masstransit and compare it to the unit test for Rebus to get this timing difference. It's actually even a 1000 messages. Because on my local system it needed a bit more to reproduce.

It's the highlighted tests below which (when in parallel) have this significant difference in timing. When doing it in a for loop the difference is not that steep. The API integration tests is what gives us the actual timeout and compares most to our production environment. For this test there is also a Masstransit version which does not timeout. img

simongullberg commented 6 months ago

I'm not sure that this is the issue here but we experienced a performance improvement when configuring minimum threads on the .NET Threadpool to a higher value than default. You can read more about SetMinThreads here. https://learn.microsoft.com/en-us/dotnet/api/system.threading.threadpool.setminthreads?view=net-8.0.

The underlying RabbitMQ.Client library that is used in Rebus.RabbitMq is using the .NET Threadpool so it is up to you to make sure you have enough threads to handle your load. RabbitMQ.Client is also sync and blocking so threads are just waiting when doing I/O.

Also, the setting MaxWriterPoolSize might also come in to play here. Maybe you should set it to something more than the default value? https://github.com/rebus-org/Rebus.RabbitMq/blob/c4afc55891128aded5f61bb8a4c5c40bdb6e6aa1/Rebus.RabbitMq/Config/RabbitMqOptionsBuilder.cs#L234C12-L234C35

Dylan-DutchAndBold commented 6 months ago

I'm not sure that this is the issue here but we experienced a performance improvement when configuring minimum threads on the .NET Threadpool to a higher value than default. You can read more about SetMinThreads here. https://learn.microsoft.com/en-us/dotnet/api/system.threading.threadpool.setminthreads?view=net-8.0.

The underlying RabbitMQ.Client library that is used in Rebus.RabbitMq is using the .NET Threadpool so it is up to you to make sure you have enough threads to handle your load. RabbitMQ.Client is also sync and blocking so threads are just waiting when doing I/O.

Also, the setting MaxWriterPoolSize might also come in to play here. Maybe you should set it to something more than the default value? https://github.com/rebus-org/Rebus.RabbitMq/blob/c4afc55891128aded5f61bb8a4c5c40bdb6e6aa1/Rebus.RabbitMq/Config/RabbitMqOptionsBuilder.cs#L234C12-L234C35

Thanks Simon for the suggestion. It's really appreciated. I have tried doubling the defaults in the demo project and it still fails with timeouts.

Also I think since Masstransit.RabbitMQ uses the same RabbitMQ library with its defaults and it comparing so differently to Rebus.RabbitMQ, I do think (with some doubt) that the problem is within Rebus.