rabbitmq / rabbitmq-stream-dotnet-client

RabbitMQ client for the stream protocol
https://rabbitmq.github.io/rabbitmq-stream-dotnet-client/stable/htmlsingle/index.html
Other
122 stars 41 forks source link

Unhandled exception (Operation is not valid due to the current state of the object in RabbitMQ.Stream.Client/Client.cs:line 976) #384

Closed mbaillargeon-ubi closed 5 months ago

mbaillargeon-ubi commented 5 months ago

Describe the bug

Hi, We are experiencing crashes on our servers (windows / net6) which internally use RabbitMQ.Stream.Client. It occurs when multiple clients are connecting at the same time. Here is the callstack we get in the windows event logs

CoreCLR Version: 6.0.3124.26714
.NET Version: 6.0.31
Description: The process was terminated due to an unhandled exception.
Exception Info: System.AggregateException: One or more errors occurred. (Operation is not valid due to the current state of the object.)
 ---> System.InvalidOperationException: Operation is not valid due to the current state of the object.
   at System.Threading.Tasks.Sources.ManualResetValueTaskSourceCore`1.SignalCompletion()
   at System.Threading.Tasks.Sources.ManualResetValueTaskSourceCore`1.SetException(Exception error)
   at RabbitMQ.Stream.Client.ManualResetValueTaskSource`1.SetException(Exception error) in /_/RabbitMQ.Stream.Client/Client.cs:line 976
   at RabbitMQ.Stream.Client.Client.<>c__56`2.<Request>b__56_0(Object valueTaskSource) in /_/RabbitMQ.Stream.Client/Client.cs:line 500
   at System.Threading.CancellationTokenSource.Invoke(Delegate d, Object state, CancellationTokenSource source)
   at System.Threading.CancellationTokenSource.CallbackNode.<>c.<ExecuteCallback>b__9_0(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.CancellationTokenSource.CallbackNode.ExecuteCallback()
   at System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean throwOnFirstException)
   --- End of inner exception stack trace ---
   at System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean throwOnFirstException)
   at System.Threading.CancellationTokenSource.TimerCallback(Object state)
   at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
   at System.Threading.TimerQueue.FireNextTimers()

This crash is with version 1.8.6 but we initially had it with 1.7.4. I tried updating the package, but no luck it still crashes.

From what I could debug, the following is occurring. In the following code (from client.cs), the callback that gets called on timeout to do the SetException is called by a system timer (as we can see in the callstack of the crash). When it gets called, the task seems to sometime be in the completed state (I guess the task just completed). The SetException call is internally calling SetException on a ManualResetValueTaskSourceCore which throws an InvalidOperationException if the task is completed. Since the exception is not catched it makes our process crash.

await using (cts.Token.Register(
                             valueTaskSource =>
                             {
                                ((ManualResetValueTaskSource<TOut>)valueTaskSource).SetException(
                                    new TimeoutException());
                             }, tcs).ConfigureAwait(false))
            {
                var valueTask = new ValueTask<TOut>(tcs, tcs.Version);
                var result = await valueTask.ConfigureAwait(false);
                PooledTaskSource<TOut>.Return(tcs);
                return result;
            }

I tried the following code in ManualResetValueTaskSource to validate the status before doing the call to SetException and I no longer reproduce the crash. If you want I could do a pull request with that code.

public void SetException(Exception error)
{
   if (_logic.GetStatus(_logic.Version) == ValueTaskSourceStatus.Pending)
     _logic.SetException(error);
 }

Reproduction steps

  1. Have multiple clients being created at the same time and seeking in the stream ...

Expected behavior

The exception should either not occur or be handled since it occurs in a different thread and the users of the library cannot catch

Additional context

To give some context we have 2 servers load balanced. Whenever a client application connects to our server we create a client to stream rabbitmq messages exchanged on a stream. Whenever we shutdown one of the servers (to perform an update for instance), all the clients switch to the other server which will create a client and stream back to the last received message from the stream. This crash occurs when we shutdown one of our server that has lots of clients (about 100). So we get about 100 clients connecting to the other server at the same time and connecting to the stream.

Gsantomaggio commented 5 months ago

Hi @mbaillargeon-ubi, The clients receive the timeout because the server is handling a lot of connections/streams at the same time. You should also check the server logs to see if there is some timeout or error.

In this case, you should add some random sleep before connecting all the clients. It will prevent all the clients from connecting at the same time.

Gsantomaggio commented 5 months ago

Btw pr with you change is welcome 😀!

Gsantomaggio commented 5 months ago

@mbaillargeon-ubi Can you please check https://github.com/rabbitmq/rabbitmq-stream-dotnet-client/pull/385 ?

mbaillargeon-ubi commented 5 months ago

Hi @Gsantomaggio . Thanks for #385, I was going to create a pull request today :) . I like that you also exposed a way to configure the timeout. It looks good to me. I will be waiting for an official release with this fix to try it out. Until then, I will try your suggestion to limit the number of clients connecting back to the stream at the same time to see if it reduces the occurences of timeouts or of that actual crash.

Gsantomaggio commented 5 months ago

@mbaillargeon-ubi FYI: https://www.nuget.org/packages/RabbitMQ.Stream.Client/1.8.7 Please let me know

mbaillargeon-ubi commented 5 months ago

I tested with the package 1.8.7 and I could not repro the crash. I will go ahead and update our production servers with this version. Thanks for the quick release