Open michac opened 1 year ago
This client is moving towards having a more, and eventually completely, asynchronous implementation of all key methods.
See #970, #1308, #1318, #1347.
That is fair enough, and like I said above I do have a work around. It does seem like this issue has been cropping up in the wild though since 6.1 (see #1078 #860 #1127 ) and I hope you'd at least consider a setting like "DedicatedWriteLoop" as a patch to 6.5 since the change could be isolated to a single file and turned off by default, then marked obsolete for 7.0. Now if you're not planning any minor releases before 7.0.0, it's a moot point.
Also, obligatory thank you for the work you all put in on this library!
I hope you'd at least consider a setting like "DedicatedWriteLoop" as a patch to 6.5 since the change could be isolated to a single file and turned off by default
Yes, please feel free to open a PR with this change.
@michac - I plan on releasing version 6.6.0 this month if you'd like to prepare a PR in time for me to review and test it. Thanks.
I have refactored my project to no longer create and initialize hundreds of channels simultaneously, so I don't personally need this change anymore. It's a nice add in case anyone else runs across this problem, but it is a setting for an edge case that's going to have to be obsoleted and phased out in the future. I have submitted a pull request, but I definitely understand if you opt not to include it.
This is also an issue for us. We're using NServiceBus and the NServiceBus.RabbitMQ project also creates more channels as concurrency increases. Point C from michac is totally spot on. This took three days to track down. Since the ThreadPool seems to set the minimum threads low (8 in my testing) it only takes a relative amount of concurrency before ThreadPool starvation (which leads to timeouts in the RabbitMQ .NET Client, but also blocks the ThreadPool for other async continuations as well)... Basically, really hope this fix gets released soon
@brad-wright-512 -
Since the ThreadPool seems to set the minimum threads low (8 in my testing) it only takes a relative amount of concurrency before ThreadPool starvation (which leads to timeouts in the RabbitMQ .NET Client, but also blocks the ThreadPool for other async continuations as well)... Basically, really hope this fix gets released soon
There is a workaround - increase the size of the ThreadPool
in your application:
@brad-wright-512 - what would help us immensely is if you can test @michac's enhancement in your environment - https://github.com/rabbitmq/rabbitmq-dotnet-client/pull/1392
@lukebakken did a bit of testing and things are certainly a lot better. The blocking on the connection itself looks to be entirely gone. However, in my most intense test script I can still see issues due to ThreadPool starvation. It it only happening when trying to create new connections. I have some quick and dirty changes I've made that fix the issue, but I'm struggling to get them formatted neatly in a comment and my company blocks github access. I'll see about getting it somewhere accessible soon.
Until then you can reproduce the connection blocking with this script
@brad-wright-512 It's a shame github is blocked. We really appreciate your efforts here!
I'm assuming that increasing the threadpool size in advance (i.e. the "usual workaround") does fix these issues in your tests... is that correct?
I think the test that brad is running has diverged a little from the original issue. The problem I was trying to solve with the dedicated write loop was the case where Rabbit operations were artificially blocking thread pool threads because the request (socket write) to trigger the response they were blocking on couldn't be sent due to lack of threads. It felt like something that the rabbit library needed to do was blocked by something else the rabbit library was doing which warranted a library solution, like putting the write loop on a dedicated thread. In the test above it looks like all the thread pool threads are being locked down before a rabbit connection is attempted, completely external to the library. In my opinion whether the library should work smoothly when there are no thread pool threads available for the full timeout window is a new question.
Well, there are still some places in the .NET RMQ Client where sync/async code is mixed and it does cause the same sort of pressure on the thread pool. Admittedly, the circumstances where this would occur are significantly less likely (creating new connections, recovering an existing connection, etc) than creating channels and publishing messages, but it does still exist. I will say something interesting of note, the write loop can still hang when there is thread pool contention even after these changes. To fix this, I had to make it truly full synchronous.
private readonly AutoResetEvent _channelChangedEvent = new AutoResetEvent(false);
// ...
public void Write(ReadOnlyMemory<byte> memory)
{
_channelWriter.TryWrite(memory);
_channelChangedEvent.Set();
}
private void SynchronousWriteLoop()
{
while (WaitToRead(_channelReader, out ReadOnlyMemory<byte> memory))
{
_socket.Client.Poll(_writeableStateTimeoutMicroSeconds, SelectMode.SelectWrite);
do
{
if (MemoryMarshal.TryGetArray(memory, out ArraySegment<byte> segment))
{
if (segment.Array != null)
{
_writer.Write(segment.Array, segment.Offset, segment.Count);
MemoryPool.Return(segment.Array);
}
}
}
while (_channelReader.TryRead(out memory));
_writer.Flush();
}
}
private bool WaitToRead(ChannelReader<ReadOnlyMemory<byte>> reader, out ReadOnlyMemory<byte> memory)
{
while (!reader.TryRead(out memory) && !reader.Completion.IsCompleted)
{
_channelChangedEvent.WaitOne(TimeSpan.FromMilliseconds(50));
}
if (reader.Completion.IsCompleted)
{
memory = ReadOnlyMemory<byte>.Empty;
return false;
}
return true;
}
To fix this, I had to make it truly full synchronous.
So you're saying that PR #1392 is not complete? I can see the difference ... not thrilled about polling.
@brad-wright-512 - did you see my comment about the workaround to expand the size of the threadpool? It would be great to confirm that doing that solves your issues using the already-released 6.5.0 version of this library.
@lukebakken it's tricky. Yes it does help because it makes the scenario where deadlock can occur smaller and smaller as you increase the number of threads... but I wonder if that itself doesn't come at some sort of cost? After a bit of fiddling, I managed to get the changes I made out on github here https://github.com/brad-wright-512/rabbitmq-dotnet-client/tree/Fix-thread-pool-contention
This shows all the places where sync/async code mixing can cause issues. (Although as previously stated, the changes made by @michac do alleviate most of them.)
I am experiencing similar problem which is related to thread pool contention and intermittent hangs in ASP.NET 4.8 applications. I do not have reproduction ATM at smaller scale, apologies. Hoping to get some answers. Apart from the thread pool contention, it seems that timer callback is capturing the ASP.NET context - there are multiple call stacks like the below - holding to MB of memory.
I am creating connection and model in synchronous way and subscribing to events.
How can I avoid the context capture?
@marafiq please give the code in this PR a try - https://github.com/rabbitmq/rabbitmq-dotnet-client/pull/1392
(also, note how I edited your comment to use <details>
to make it much more readable).
Describe the bug
CreateModel writes to the async write channel in the SocketFrameHandler, then blocks waiting for a response. If you queue up enough of these operations it uses up all the available threads in the thread pool. The thread pool will start provisioning more threads in response, but once you're over your min thread count it does so relatively slowly, and you can find yourself timing out before it's gotten around to processing the write queue. I encountered this issue upgrading from 5.2 to 6.5. In my case I am using a library called Akka.NET to manage background jobs. I create an actor (aka thread with message pump) per job, and based on the documentation I figured I needed a channel per job. Over time jobs accumulated in my test environment and now when my app starts up I am attempting to create over 100 channels simultaneously. Note: This applies to all synchronous operations, not just CreateModel. I can reproduce this with QueueDeclare for example. In retrospect I would not have implemented my app this way, for several reasons, and in the medium term I plan to centralize my rabbit interactions more and not have so many threads interacting with the library directly. However, I do think it's worth considering fixing this issue because a) I can see how someone could end up here b) I feel like the library is not being a good ThreadPool citizen by locking down threads and relying on theadpool headroom to get unlocked c) It's kind of a bear to figure out what's going on when you encounter the problem.
Reproduction steps
https://github.com/michac/RabbitThreadPoolTest
Expected behavior
Operations like CreateModel are able to complete even if the ThreadPool is too busy to provide a second thread.
I have a few ideas for fixing the problem:
Additional context
Problems like this should go away once the library is async across the board (which sounds like the direction you are headed).