Synchronous operations like CreateModel block their thread waiting for a response to an async Write that they could be blocking.

michac commented 1 year ago

Describe the bug

CreateModel writes to the async write channel in the SocketFrameHandler, then blocks waiting for a response. If you queue up enough of these operations it uses up all the available threads in the thread pool. The thread pool will start provisioning more threads in response, but once you're over your min thread count it does so relatively slowly, and you can find yourself timing out before it's gotten around to processing the write queue. I encountered this issue upgrading from 5.2 to 6.5. In my case I am using a library called Akka.NET to manage background jobs. I create an actor (aka thread with message pump) per job, and based on the documentation I figured I needed a channel per job. Over time jobs accumulated in my test environment and now when my app starts up I am attempting to create over 100 channels simultaneously. Note: This applies to all synchronous operations, not just CreateModel. I can reproduce this with QueueDeclare for example. In retrospect I would not have implemented my app this way, for several reasons, and in the medium term I plan to centralize my rabbit interactions more and not have so many threads interacting with the library directly. However, I do think it's worth considering fixing this issue because a) I can see how someone could end up here b) I feel like the library is not being a good ThreadPool citizen by locking down threads and relying on theadpool headroom to get unlocked c) It's kind of a bear to figure out what's going on when you encounter the problem.

Reproduction steps

https://github.com/michac/RabbitThreadPoolTest

// See https://aka.ms/new-console-template for more information

using RabbitMQ.Client;

Console.WriteLine("Hello, World!");

var factory = new ConnectionFactory()
{
    Password = "password",
    UserName = "tester",
    Port = 5672,
    HostName = "localhost"
};
var connection = factory.CreateConnection();

ThreadPool.SetMinThreads(2, 2);

var tasks = Enumerable.Range(1, 100)
    .Select(i => Task.Run(() =>
{
    WriteThreadPoolStatus(i);
    Console.WriteLine($"Creating Model #{i}...");
    var model =  connection.CreateModel();
    Console.WriteLine($"Model #{i} Ready");
    return model;
})).ToArray();

// If you mark all the CreateModel Tasks as LongRunning, they don't 
//   block ThreadPool threads and it fixes congestion.
// var tasks = Enumerable.Range(1, 100)
//     .Select(i => Task.Factory.StartNew(() =>
//     {
//         WriteThreadPoolStatus(i);
//         Console.WriteLine($"Creating Model #{i}...");
//         var model =  connection.CreateModel();
//         Console.WriteLine($"Model #{i} Ready");
//         return model;
//     }, TaskCreationOptions.LongRunning)).ToArray();

Task.WaitAll(tasks.ToArray<Task>());

WriteThreadPoolStatus(0);

void WriteThreadPoolStatus(int id)
{
    Console.WriteLine($"Id[{id}]: ThreadCount={ThreadPool.ThreadCount}");
    Console.WriteLine($"Id[{id}]: WorkItems: Pending={ThreadPool.PendingWorkItemCount}, Completed={ThreadPool.CompletedWorkItemCount}");
    ThreadPool.GetMinThreads(out var minWorkerCount, out var _);
    ThreadPool.GetMaxThreads(out var maxWorkerCount, out var _);
    Console.WriteLine($"Id[{id}]: Configuration: Min={minWorkerCount}, Max={maxWorkerCount}");
    ThreadPool.GetAvailableThreads(out var availableWorkerCount, out var _);
    var grinchThreads = maxWorkerCount - minWorkerCount;
    var santaThreads = Math.Max(0, availableWorkerCount - grinchThreads);
    Console.WriteLine($"Id[{id}]: AvailableThreads: FastProvision={santaThreads}, SlowProvision={availableWorkerCount - santaThreads}");
}

Expected behavior

Operations like CreateModel are able to complete even if the ThreadPool is too busy to provide a second thread.

I have a few ideas for fixing the problem:

I am remediating it for now (pending any refactoring on my part) by wrapping every synchronous operation in a LongRunning Task. This takes it off the ThreadPool so it won't clog things up. I think this creates a lot of unnecessary threads though.
I updated the library to put the WriteLoop in SocketFrameHandler on a LongRunning thread, and made it completely synchronous. This solved the problem by making the Writes able to go through even if the ThreadPool was saturated.
If the problem isn't worth fixing it would be nice to make it easier to identify: For example if the WriteLoop could detect if it was pulling old Writes off of its channel and log an error or signal a problem.

Additional context

Problems like this should go away once the library is async across the board (which sounds like the direction you are headed).

michaelklishin commented 1 year ago

This client is moving towards having a more, and eventually completely, asynchronous implementation of all key methods.

See #970, #1308, #1318, #1347.

michac commented 1 year ago

That is fair enough, and like I said above I do have a work around. It does seem like this issue has been cropping up in the wild though since 6.1 (see #1078 #860 #1127 ) and I hope you'd at least consider a setting like "DedicatedWriteLoop" as a patch to 6.5 since the change could be isolated to a single file and turned off by default, then marked obsolete for 7.0. Now if you're not planning any minor releases before 7.0.0, it's a moot point.

Also, obligatory thank you for the work you all put in on this library!

lukebakken commented 1 year ago

I hope you'd at least consider a setting like "DedicatedWriteLoop" as a patch to 6.5 since the change could be isolated to a single file and turned off by default

Yes, please feel free to open a PR with this change.

lukebakken commented 1 year ago

@michac - I plan on releasing version 6.6.0 this month if you'd like to prepare a PR in time for me to review and test it. Thanks.

michac commented 1 year ago

I have refactored my project to no longer create and initialize hundreds of channels simultaneously, so I don't personally need this change anymore. It's a nice add in case anyone else runs across this problem, but it is a setting for an edge case that's going to have to be obsoleted and phased out in the future. I have submitted a pull request, but I definitely understand if you opt not to include it.

brad-wright-512 commented 1 year ago

This is also an issue for us. We're using NServiceBus and the NServiceBus.RabbitMQ project also creates more channels as concurrency increases. Point C from michac is totally spot on. This took three days to track down. Since the ThreadPool seems to set the minimum threads low (8 in my testing) it only takes a relative amount of concurrency before ThreadPool starvation (which leads to timeouts in the RabbitMQ .NET Client, but also blocks the ThreadPool for other async continuations as well)... Basically, really hope this fix gets released soon

lukebakken commented 1 year ago

@brad-wright-512 -

Since the ThreadPool seems to set the minimum threads low (8 in my testing) it only takes a relative amount of concurrency before ThreadPool starvation (which leads to timeouts in the RabbitMQ .NET Client, but also blocks the ThreadPool for other async continuations as well)... Basically, really hope this fix gets released soon

There is a workaround - increase the size of the ThreadPool in your application:

https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/42c35a1b96a8decea435659d9257b3fda48b6186/projects/TestApplications/CreateChannel/Program.cs#L20

@brad-wright-512 - what would help us immensely is if you can test @michac's enhancement in your environment - https://github.com/rabbitmq/rabbitmq-dotnet-client/pull/1392

brad-wright-512 commented 1 year ago

@lukebakken did a bit of testing and things are certainly a lot better. The blocking on the connection itself looks to be entirely gone. However, in my most intense test script I can still see issues due to ThreadPool starvation. It it only happening when trying to create new connections. I have some quick and dirty changes I've made that fix the issue, but I'm struggling to get them formatted neatly in a comment and my company blocks github access. I'll see about getting it somewhere accessible soon.

Until then you can reproduce the connection blocking with this script

``` C# using System.Diagnostics; using RabbitMQ.Client; var done = false; var threadPoolMonitor = new Thread(MonitorThreadPool) { IsBackground = true }; threadPoolMonitor.Start(); void MonitorThreadPool() { while (!done) { var queuedItems = ThreadPool.PendingWorkItemCount; var threadCount = ThreadPool.ThreadCount; //uncomment to watch ThreadPool queue drain //Console.WriteLine($"Queued: {queuedItems}, Threads: {threadCount}"); Thread.Sleep(1000); } } // simulate thread pool contention var tasks = Enumerable.Range(0, 50) .Select(n => Task.Run(() => Thread.Sleep(5000))) .ToList(); var (hostname, virtualHost, username, password) = ( "localhost", "/", "guest", "guest" ); var factory = new ConnectionFactory() { HostName = hostname, VirtualHost = virtualHost, UserName = username, Password = password, UseBackgroundThreadsForIO = true, Ssl = new SslOption(hostname) { Enabled = false, } }; factory.Ssl.CertificateValidationCallback += (sender, cert, chain, error) => true; var sw = Stopwatch.StartNew(); using (var connection = factory.CreateConnection("RMQ Client Test")) { Console.WriteLine($"Connection created in {sw.Elapsed}"); sw.Restart(); } Console.WriteLine($"Connection closed in {sw.Elapsed}"); done = true; threadPoolMonitor.Join(); Console.WriteLine("Joined"); ```

lukebakken commented 1 year ago

@brad-wright-512 It's a shame github is blocked. We really appreciate your efforts here!

I'm assuming that increasing the threadpool size in advance (i.e. the "usual workaround") does fix these issues in your tests... is that correct?

michac commented 1 year ago

I think the test that brad is running has diverged a little from the original issue. The problem I was trying to solve with the dedicated write loop was the case where Rabbit operations were artificially blocking thread pool threads because the request (socket write) to trigger the response they were blocking on couldn't be sent due to lack of threads. It felt like something that the rabbit library needed to do was blocked by something else the rabbit library was doing which warranted a library solution, like putting the write loop on a dedicated thread. In the test above it looks like all the thread pool threads are being locked down before a rabbit connection is attempted, completely external to the library. In my opinion whether the library should work smoothly when there are no thread pool threads available for the full timeout window is a new question.

brad-wright-512 commented 1 year ago

Well, there are still some places in the .NET RMQ Client where sync/async code is mixed and it does cause the same sort of pressure on the thread pool. Admittedly, the circumstances where this would occur are significantly less likely (creating new connections, recovering an existing connection, etc) than creating channels and publishing messages, but it does still exist. I will say something interesting of note, the write loop can still hang when there is thread pool contention even after these changes. To fix this, I had to make it truly full synchronous.

private readonly AutoResetEvent _channelChangedEvent = new AutoResetEvent(false);
// ...
public void Write(ReadOnlyMemory<byte> memory)
{
    _channelWriter.TryWrite(memory);
    _channelChangedEvent.Set();
}
private void SynchronousWriteLoop()
{
    while (WaitToRead(_channelReader, out ReadOnlyMemory<byte> memory))
    {
        _socket.Client.Poll(_writeableStateTimeoutMicroSeconds, SelectMode.SelectWrite);
        do
        {
            if (MemoryMarshal.TryGetArray(memory, out ArraySegment<byte> segment))
            {
                if (segment.Array != null)
                {
                    _writer.Write(segment.Array, segment.Offset, segment.Count);
                    MemoryPool.Return(segment.Array);
                }
            }
        }
        while (_channelReader.TryRead(out memory));
        _writer.Flush();
    }
}
private bool WaitToRead(ChannelReader<ReadOnlyMemory<byte>> reader, out ReadOnlyMemory<byte> memory)
{
    while (!reader.TryRead(out memory) && !reader.Completion.IsCompleted)
    {
        _channelChangedEvent.WaitOne(TimeSpan.FromMilliseconds(50));             
    }

    if (reader.Completion.IsCompleted)
    {
        memory = ReadOnlyMemory<byte>.Empty;
        return false;
    }

    return true;
}

lukebakken commented 1 year ago

To fix this, I had to make it truly full synchronous.

So you're saying that PR #1392 is not complete? I can see the difference ... not thrilled about polling.

@brad-wright-512 - did you see my comment about the workaround to expand the size of the threadpool? It would be great to confirm that doing that solves your issues using the already-released 6.5.0 version of this library.

brad-wright-512 commented 1 year ago

@lukebakken it's tricky. Yes it does help because it makes the scenario where deadlock can occur smaller and smaller as you increase the number of threads... but I wonder if that itself doesn't come at some sort of cost? After a bit of fiddling, I managed to get the changes I made out on github here https://github.com/brad-wright-512/rabbitmq-dotnet-client/tree/Fix-thread-pool-contention

This shows all the places where sync/async code mixing can cause issues. (Although as previously stated, the changes made by @michac do alleviate most of them.)

marafiq commented 5 months ago

I am experiencing similar problem which is related to thread pool contention and intermittent hangs in ASP.NET 4.8 applications. I do not have reproduction ATM at smaller scale, apologies. Hoping to get some answers. Apart from the thread pool contention, it seems that timer callback is capturing the ASP.NET context - there are multiple call stacks like the below - holding to MB of memory.

``` Total size of all objects with this GC Root Chain 3.27 MBytes Number of objects with this GC Root Chain 30615 Address of 1st object found with this GC Root Chain path 0x1def78ba3b8 static var System.Threading.TimerQueue.s_queue--> System.Threading.TimerQueue--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerQueueTimer--> System.Threading.TimerCallback--> RabbitMQ.Client.Framing.Impl.Connection--> RabbitMQ.Client.Impl.SocketFrameHandler--> RabbitMQ.Client.Impl.TcpClientAdapter--> System.Net.Sockets.Socket--> System.Net.Sockets.Socket+CacheSet--> System.Net.CallbackClosure--> System.Threading.ExecutionContext--> System.Web.AspNetSynchronizationContext--> System.Web.AspNetSynchronizationContext+State--> System.Web.Util.SynchronizationHelper--> ASP.global_asax--> System.Web.PipelineModuleStepContainer[]--> System.Web.PipelineModuleStepContainer--> System.Collections.Generic.List[]--> System.Collections.Generic.List--> System.Web.HttpApplication+IExecutionStep[]--> System.Web.HttpApplication+SyncEventExecutionStep--> System.EventHandler--> System.Web.Routing.UrlRoutingModule--> System.Web.Routing.RouteCollection--> System.Collections.Generic.List--> System.Web.Routing.RouteBase[]--> System.Web.Mvc.Routing.RouteCollectionRoute--> System.Web.Mvc.Routing.SubRouteCollection--> System.Collections.Generic.List--> System.Web.Routing.Route[]--> System.Web.Routing.Route--> System.Web.Routing.RouteValueDictionary--> System.Collections.Generic.Dictionary--> System.Collections.Generic.Dictionary+Entry[]--> System.Web.Mvc.ActionDescriptor[]--> System.Web.Mvc.ReflectedActionDescriptor--> System.Web.Mvc.Async.ReflectedAsyncControllerDescriptor--> System.Web.Mvc.Async.AsyncActionMethodSelector--> System.Reflection.MethodInfo[]--> System.Reflection.RuntimeMethodInfo--> System.RuntimeType+RuntimeTypeCache--> System.RuntimeType+RuntimeTypeCache+MemberInfoCache--> System.Reflection.RuntimeMethodInfo[]--> System.Reflection.RuntimeMethodInfo ```

I am creating connection and model in synchronous way and subscribing to events.

Code

``` private void TryConnect() { if (IsConnected) { return; } lock (_syncRoot) { if (IsConnected) { return; } Log.Debug($"RabbitMQ Client is trying to connect: {DisplayUri}"); var uri = Uri; if (_connections.TryGetValue(uri, out var currentConnection)) { DisposeConnection(currentConnection); } var connection = _connections[uri] = new ConnectionFactory { ClientProvidedName = _clientName, Uri = new Uri(Uri), } .CreateConnection(); if (!connection.IsOpen) { _connections.Remove(uri); throw new Exception($"Establishing RabbitMQ connection failed: {DisplayUri}"); } connection.ConnectionShutdown += OnConnectionShutdown; connection.CallbackException += OnCallbackException; connection.ConnectionBlocked += OnConnectionBlocked; Log.Info($"RabbitMQ Client acquired a persistent connection to '{connection.Endpoint.HostName}' and is subscribed to failure events"); } } ```

How can I avoid the context capture?

lukebakken commented 5 months ago

@marafiq please give the code in this PR a try - https://github.com/rabbitmq/rabbitmq-dotnet-client/pull/1392

(also, note how I edited your comment to use <details> to make it much more readable).

rabbitmq / rabbitmq-dotnet-client