rebus-org / Rebus

:bus: Simple and lean service bus implementation for .NET
https://mookid.dk/category/rebus
Other
2.26k stars 353 forks source link

_semaphore inside ParallelOperationsManager is not always released #1118

Closed oleggolovkov closed 7 months ago

oleggolovkov commented 9 months ago

I am using sagas with Mysql storage and transport with default concurrency setup (1 worker, 5 max parallelism), one of the steps of the saga is calling another service which never responds (e.g. has a breakpoint set or just does Thread.Sleep), after 30 seconds the call times out and is retried by Rebus, however after Unhandled exception 2 while handling message with ID the retries stop and it looks like nothing is going on. I then stopped the service running Rebus and saw this message Not all async tasks were able to finish within given timeout of 60 seconds! I found the conditions for this message in the code, took a memory dump of my service and inspected the ParallelOperationsManager:

rebus

Indeed the m_currentCount is 0 (healthy condition is when it is equal to m_maxCount) which I believe causes the service to stall and not be able to process anything or shutdown correctly

mookid8000 commented 9 months ago

Could it be that the call that times out blocks the thread forever? Because that is really the only situation I can imagine that would have this effect....

oleggolovkov commented 9 months ago

I don't have a minimal repro without external dependencies to figure it out but the saga step call that I'm doing (and that is never complete due to breakpoint) is actually invoking a grain using client from Microsoft Orleans like this await _clusterClient.GetGrain<IMyGrain>(walletId).CallSomeMethod() I don't know what exactly happens in that case in all details but I would very much expect some thread blocking in this case because Orleans has single-threaded execution guarantees all over the place. However I haven't noticed any signs of thread starvation in Orleans itself in similar timeout cases (been using it for like 5 years). Also I did not mention this in my initial description but when no timeouts happen the system operates without any issues with Rebus handling all incoming messages as expected

mookid8000 commented 8 months ago

Hi @oleggolovkov , did you figure out why it looked like the semaphore was not released when you had timeouts?

oleggolovkov commented 8 months ago

no, I was planning to send a PR with a fix to Rebus since I would expect this usage pattern to not cause any issues here since it works fine in controllers, minimal apis, console apps etc, however I was not able to quickly understand how and why this is happening, I was also not able to figure out the fix to https://github.com/rebus-org/Rebus.MySql/issues/27 and since my use case only actually needed a small subset of features I just reimplemented it myself

mookid8000 commented 7 months ago

Ok 🙂 I am not convinced that there actually is an issue here. The line responsible for releasing the semaphore reads

using (parallelOperation)

where parallelOperation is an IDisposable that releases it in its Dispose method. The only way I can think of where the semaphore would not be released, would be if the operation executed took forever (i.e. never returned). Please don't do that (and please provide appropriate timeouts when calling out to external things/implementing retries/etc)