rabbitmq / rabbitmq-dotnet-client

RabbitMQ .NET client for .NET Standard 2.0+ and .NET 4.6.2+
https://www.rabbitmq.com/dotnet.html
Other
2.05k stars 575 forks source link

Consider re-trying opening connections in a specific scenario #1448

Closed lukebakken closed 6 months ago

lukebakken commented 6 months ago

Is your feature request related to a problem? Please describe.

Frequently, on Windows using .NET Framework 4.7.2, GitHub actions reports a System.Net.Sockets.SocketException. RabbitMQ reports that the "client closed the connection unexpectedly", so we can be reasonably sure that RabbitMQ did not close the connection. This doesn't eliminate the chance that this is a bug within Erlang/OTP on Windows, but the fact that this issue never occurs with .NET core on Windows contradicts that possibility.

  Passed Test.Integration.TestInitialConnection.TestBasicConnectionRecoveryWithHostnameListAndUnreachableHosts [12 s]
  Standard Output Messages:
 WARNING: CreateConnectionWithRetries retrying (1), caught exception: System.Net.Sockets.SocketException (0x80004005): An existing connection was forcibly closed by the remote host
    at System.Net.Sockets.Socket.BeginSend(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags, AsyncCallback callback, Object state)
    at System.Net.Sockets.NetworkStream.BeginWrite(Byte[] buffer, Int32 offset, Int32 size, AsyncCallback callback, Object state)

In #1447, I added re-tries to the test suite for this specific exception, and the first re-try usually succeeds. Internet searches for this specific exception usually point to a firewall issue, but of course this isn't the case on GHA because the test suite and RabbitMQ are running on the same host.

Other info:

Should this specific re-try be done in the library itself?

cc @paulomorgado @danielmarbach @bording

Describe the solution you'd like

Consider re-trying in this specific case.

Additional context

Full stacktrace from this workflow run

Full stacktrace ``` 2023-12-18T22:25:47.5009472Z [xUnit.net 00:00:25.31] Test.Integration.TestInitialConnection.TestBasicConnectionRecoveryWithHostnameList [FAIL] 2023-12-18T22:25:47.5017063Z [xUnit.net 00:00:25.31] RabbitMQ.Client.Exceptions.BrokerUnreachableException : None of the specified endpoints were reachable 2023-12-18T22:25:47.5019058Z [xUnit.net 00:00:25.31] ---- System.IO.IOException : Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host. 2023-12-18T22:25:47.5020530Z [xUnit.net 00:00:25.31] -------- System.Net.Sockets.SocketException : An existing connection was forcibly closed by the remote host 2023-12-18T22:25:47.5031137Z [xUnit.net 00:00:25.31] Stack Trace: 2023-12-18T22:25:47.5044600Z [xUnit.net 00:00:25.31] D:\a\rabbitmq-dotnet-client\rabbitmq-dotnet-client\projects\RabbitMQ.Client\client\api\ConnectionFactory.cs(692,0): at RabbitMQ.Client.ConnectionFactory.CreateConnection(IEndpointResolver endpointResolver, String clientProvidedName) 2023-12-18T22:25:47.5060098Z [xUnit.net 00:00:25.31] D:\a\rabbitmq-dotnet-client\rabbitmq-dotnet-client\projects\RabbitMQ.Client\client\api\ConnectionFactory.cs(537,0): at RabbitMQ.Client.ConnectionFactory.CreateConnection(IEnumerable`1 hostnames, String clientProvidedName) 2023-12-18T22:25:47.5064104Z [xUnit.net 00:00:25.31] D:\a\rabbitmq-dotnet-client\rabbitmq-dotnet-client\projects\RabbitMQ.Client\client\api\ConnectionFactory.cs(490,0): at RabbitMQ.Client.ConnectionFactory.CreateConnection(IEnumerable`1 hostnames) 2023-12-18T22:25:47.5068719Z [xUnit.net 00:00:25.31] D:\a\rabbitmq-dotnet-client\rabbitmq-dotnet-client\projects\Test\Common\IntegrationFixtureBase.cs(214,0): at Test.IntegrationFixtureBase.CreateAutorecoveringConnection(IEnumerable`1 hostnames, TimeSpan requestedConnectionTimeout, TimeSpan networkRecoveryInterval) 2023-12-18T22:25:47.5073077Z [xUnit.net 00:00:25.31] D:\a\rabbitmq-dotnet-client\rabbitmq-dotnet-client\projects\Test\Common\IntegrationFixtureBase.cs(202,0): at Test.IntegrationFixtureBase.CreateAutorecoveringConnection(IList`1 hostnames) 2023-12-18T22:25:47.5076097Z [xUnit.net 00:00:25.31] TestInitialConnection.cs(49,0): at Test.Integration.TestInitialConnection.TestBasicConnectionRecoveryWithHostnameList() 2023-12-18T22:25:47.5077958Z [xUnit.net 00:00:25.31] ----- Inner Stack Trace ----- 2023-12-18T22:25:47.5079553Z [xUnit.net 00:00:25.31] at System.Net.Sockets.NetworkStream.BeginWrite(Byte[] buffer, Int32 offset, Int32 size, AsyncCallback callback, Object state) 2023-12-18T22:25:47.5081906Z [xUnit.net 00:00:25.31] at System.IO.Stream.<>c.b__53_0(Stream stream, ReadWriteParameters args, AsyncCallback callback, Object state) 2023-12-18T22:25:47.5084704Z [xUnit.net 00:00:25.31] at System.Threading.Tasks.TaskFactory`1.FromAsyncTrim[TInstance,TArgs](TInstance thisRef, TArgs args, Func`5 beginMethod, Func`3 endMethod) 2023-12-18T22:25:47.5086740Z [xUnit.net 00:00:25.31] at System.IO.Stream.BeginEndWriteAsync(Byte[] buffer, Int32 offset, Int32 count) 2023-12-18T22:25:47.5088563Z [xUnit.net 00:00:25.31] at System.IO.Stream.WriteAsync(Byte[] buffer, Int32 offset, Int32 count, CancellationToken cancellationToken) 2023-12-18T22:25:47.5090723Z [xUnit.net 00:00:25.31] at System.IO.Pipelines.StreamExtensions.WriteAsync(Stream stream, ReadOnlyMemory`1 buffer, CancellationToken cancellationToken) 2023-12-18T22:25:47.5092747Z [xUnit.net 00:00:25.31] at System.IO.Pipelines.StreamPipeWriter.d__39.MoveNext() 2023-12-18T22:25:47.5094673Z [xUnit.net 00:00:25.31] --- End of stack trace from previous location where exception was thrown --- 2023-12-18T22:25:47.5096138Z [xUnit.net 00:00:25.31] at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() 2023-12-18T22:25:47.5097980Z [xUnit.net 00:00:25.31] at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) 2023-12-18T22:25:47.5100142Z [xUnit.net 00:00:25.31] at System.Runtime.CompilerServices.ConfiguredValueTaskAwaitable`1.ConfiguredValueTaskAwaiter.GetResult() 2023-12-18T22:25:47.5103295Z [xUnit.net 00:00:25.31] D:\a\rabbitmq-dotnet-client\rabbitmq-dotnet-client\projects\RabbitMQ.Client\client\impl\SocketFrameHandler.cs(293,0): at RabbitMQ.Client.Impl.SocketFrameHandler.d__37.MoveNext() 2023-12-18T22:25:47.5106019Z [xUnit.net 00:00:25.31] --- End of stack trace from previous location where exception was thrown --- 2023-12-18T22:25:47.5107553Z [xUnit.net 00:00:25.31] at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() 2023-12-18T22:25:47.5109412Z [xUnit.net 00:00:25.31] at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) 2023-12-18T22:25:47.5111575Z [xUnit.net 00:00:25.31] at System.Runtime.CompilerServices.ConfiguredTaskAwaitable.ConfiguredTaskAwaiter.GetResult() 2023-12-18T22:25:47.5114684Z [xUnit.net 00:00:25.31] D:\a\rabbitmq-dotnet-client\rabbitmq-dotnet-client\projects\RabbitMQ.Client\client\impl\Connection.Commands.cs(86,0): at RabbitMQ.Client.Framing.Impl.Connection.d__4.MoveNext() 2023-12-18T22:25:47.5116920Z [xUnit.net 00:00:25.31] --- End of stack trace from previous location where exception was thrown --- 2023-12-18T22:25:47.5117765Z [xUnit.net 00:00:25.31] at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() 2023-12-18T22:25:47.5118966Z [xUnit.net 00:00:25.31] at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) 2023-12-18T22:25:47.5120196Z [xUnit.net 00:00:25.31] at System.Runtime.CompilerServices.ConfiguredValueTaskAwaitable.ConfiguredValueTaskAwaiter.GetResult() 2023-12-18T22:25:47.5121876Z [xUnit.net 00:00:25.31] D:\a\rabbitmq-dotnet-client\rabbitmq-dotnet-client\projects\RabbitMQ.Client\client\impl\Connection.cs(233,0): at RabbitMQ.Client.Framing.Impl.Connection.d__89.MoveNext() 2023-12-18T22:25:47.5124324Z [xUnit.net 00:00:25.31] --- End of stack trace from previous location where exception was thrown --- 2023-12-18T22:25:47.5125660Z [xUnit.net 00:00:25.31] at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() 2023-12-18T22:25:47.5129614Z [xUnit.net 00:00:25.31] D:\a\rabbitmq-dotnet-client\rabbitmq-dotnet-client\projects\RabbitMQ.Client\client\impl\Connection.cs(252,0): at RabbitMQ.Client.Framing.Impl.Connection.d__89.MoveNext() 2023-12-18T22:25:47.5143053Z [xUnit.net 00:00:25.31] --- End of stack trace from previous location where exception was thrown --- 2023-12-18T22:25:47.5150051Z [xUnit.net 00:00:25.31] at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() 2023-12-18T22:25:47.5153551Z [xUnit.net 00:00:25.31] at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) 2023-12-18T22:25:47.5156030Z [xUnit.net 00:00:25.31] at System.Threading.Tasks.ValueTask`1.get_Result() 2023-12-18T22:25:47.5157989Z [xUnit.net 00:00:25.31] D:\a\rabbitmq-dotnet-client\rabbitmq-dotnet-client\projects\RabbitMQ.Client\client\TaskExtensions.cs(158,0): at RabbitMQ.Client.TaskExtensions.EnsureCompleted[T](ValueTask`1 task) 2023-12-18T22:25:47.5160323Z [xUnit.net 00:00:25.31] D:\a\rabbitmq-dotnet-client\rabbitmq-dotnet-client\projects\RabbitMQ.Client\client\impl\Connection.cs(217,0): at RabbitMQ.Client.Framing.Impl.Connection.Open() 2023-12-18T22:25:47.5164866Z [xUnit.net 00:00:25.31] D:\a\rabbitmq-dotnet-client\rabbitmq-dotnet-client\projects\RabbitMQ.Client\client\impl\AutorecoveringConnection.cs(72,0): at RabbitMQ.Client.Framing.Impl.AutorecoveringConnection.Open() ```
paulomorgado commented 6 months ago

That would be nice to have, but I think the user should be in control of that.

Maybe some events to allow the user to act on it would be nice.

danielmarbach commented 6 months ago

When it comes to the generic topic of recovering connections I do believe having some convenience in the lib is a good thing.

I do question though tackling it from the perspective of this problem here. Sockets in Net Framework still heavily uses the old TPM Begin and End pattern and as such dispatches probably quite a bit over the worker thread pool if I'm not mistaken. Starting from NET 5 socket has undergone quite a few improvements in various areas including Net 6 ramping up improving async support. My hunch is that action runners suffer from noisy neighbors affects on the shared infrastructure host underneath which can lead to significant delays which might delay the execution of the begin/end pairs significantly to the point of making things fail.

Or to quote Andrey Akinshin

Two subsequent builds on the same revision can have ranges of 1.5..2 seconds and 12..36 seconds. CPU-bound benchmarks are much more stable than Memory/Disk-bound benchmarks, but the “average” performance levels still can be up to three times different across builds.

Based on this brief study, I do not recommend using the default GitHub Actions build agent pool for any kind of performance comparisons across multiple builds: such results can not be trusted in the general case. If you want to get a reliable set of performance tests, it’s better to have a dedicated pool of physical build agents with a unified hardware/software configuration using carefully prepared OS images.

Performance stability of GitHub Actions

I know this is about benchmarking but the same unreliability also applies to concurrency here too I would say.

Combined with the fact that 4.7. 2 is not the most current, net framework is a dying platform and we haven't seen it on NET I would argue the energy should be spent on other parts of the library.

lukebakken commented 6 months ago

Thanks everyone for your input!