Closed oleggolovkov closed 7 months ago
Could it be that the call that times out blocks the thread forever? Because that is really the only situation I can imagine that would have this effect....
I don't have a minimal repro without external dependencies to figure it out but the saga step call that I'm doing (and that is never complete due to breakpoint) is actually invoking a grain using client from Microsoft Orleans like this
await _clusterClient.GetGrain<IMyGrain>(walletId).CallSomeMethod()
I don't know what exactly happens in that case in all details but I would very much expect some thread blocking in this case because Orleans has single-threaded execution guarantees all over the place. However I haven't noticed any signs of thread starvation in Orleans itself in similar timeout cases (been using it for like 5 years).
Also I did not mention this in my initial description but when no timeouts happen the system operates without any issues with Rebus handling all incoming messages as expected
Hi @oleggolovkov , did you figure out why it looked like the semaphore was not released when you had timeouts?
no, I was planning to send a PR with a fix to Rebus since I would expect this usage pattern to not cause any issues here since it works fine in controllers, minimal apis, console apps etc, however I was not able to quickly understand how and why this is happening, I was also not able to figure out the fix to https://github.com/rebus-org/Rebus.MySql/issues/27 and since my use case only actually needed a small subset of features I just reimplemented it myself
Ok 🙂 I am not convinced that there actually is an issue here. The line responsible for releasing the semaphore reads
using (parallelOperation)
where parallelOperation
is an IDisposable
that releases it in its Dispose
method. The only way I can think of where the semaphore would not be released, would be if the operation executed took forever (i.e. never returned). Please don't do that (and please provide appropriate timeouts when calling out to external things/implementing retries/etc)
I am using sagas with Mysql storage and transport with default concurrency setup (1 worker, 5 max parallelism), one of the steps of the saga is calling another service which never responds (e.g. has a breakpoint set or just does Thread.Sleep), after 30 seconds the call times out and is retried by Rebus, however after
Unhandled exception 2 while handling message with ID
the retries stop and it looks like nothing is going on. I then stopped the service running Rebus and saw this messageNot all async tasks were able to finish within given timeout of 60 seconds!
I found the conditions for this message in the code, took a memory dump of my service and inspected the ParallelOperationsManager:Indeed the
m_currentCount
is0
(healthy condition is when it is equal tom_maxCount
) which I believe causes the service to stall and not be able to process anything or shutdown correctly