rebus-org / Rebus.SqlServer

:bus: Microsoft SQL Server transport and persistence for Rebus
https://mookid.dk/category/rebus
Other
43 stars 42 forks source link

The server failed to resume the transaction #34

Closed jezza323 closed 5 years ago

jezza323 commented 5 years ago

I have come across this issue 2 days in a row. The root cause seems to be a poorly managed network (we have other network issues with this customer) however I would like to be able to recover from the issue more gracefully.

It appears the SQL connection in use has been closed unexpectedly but left in an invalid state and Rebus is attempting to reuse it continually (the final exception below repeats indefinitely until I restart the process).

Do you have any advice on where to start? I can't see much in the way of connection cleanup in the source but I may have missed it

My Rebus config is as follows, using the default connection provider, using the 4.0.0 .Net 4.5 package

            container.ConfigureRebus(c => c
                .Logging(l => l.Serilog(log))
                .Transport(t => t.UseSqlServer(Configuration.KalibreMessageQueueConnection, Configuration.KalibreMessageQueueTable, Configuration.KalibreMessageQueueInputQueue))
                .Routing(r => r.TypeBased()
                    ...
                )
                .Subscriptions(s => s.StoreInSqlServer(Configuration.KalibreMessageQueueConnection, Configuration.KalibreMessageQueueSubscriptionTable))
                .Options(o =>
                {
                    o.EnableDataBus().StoreInSqlServer(Configuration.KalibreMessageQueueConnection, Configuration.KalibreMessageQueueDataTable);
                    o.SetNumberOfWorkers(int.Parse(System.Configuration.ConfigurationManager.AppSettings["BusWorkers"]));
                    o.SetMaxParallelism(int.Parse(System.Configuration.ConfigurationManager.AppSettings["BusWorkers"]));
                    o.SetBackoffTimes(TimeSpan.FromSeconds(1));
                })
                .Start());
2018-09-21 04:28:25.754 +09:30 [Error] System.AggregateException: One or more errors occurred. ---> System.IndexOutOfRangeException: address
   at System.Data.ProviderBase.FieldNameLookup.GetOrdinal(String fieldName)
   at System.Data.SqlClient.SqlDataReader.GetOrdinal(String name)
   at System.Data.SqlClient.SqlDataReader.get_Item(String name)
   at Rebus.SqlServer.Subscriptions.SqlServerSubscriptionStorage.<GetSubscriberAddresses>d__9.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Rebus.Bus.RebusBus.<InnerPublish>d__32.MoveNext()
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at Kalibre.Dashboard.Shared.Handlers.EventsUpdatedHandler.<Handle>d__2.MoveNext()
---> (Inner Exception #0) System.IndexOutOfRangeException: address
   at System.Data.ProviderBase.FieldNameLookup.GetOrdinal(String fieldName)
   at System.Data.SqlClient.SqlDataReader.GetOrdinal(String name)
   at System.Data.SqlClient.SqlDataReader.get_Item(String name)
   at Rebus.SqlServer.Subscriptions.SqlServerSubscriptionStorage.<GetSubscriberAddresses>d__9.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Rebus.Bus.RebusBus.<InnerPublish>d__32.MoveNext()<---

2018-09-21 04:28:26.098 +09:30 [Warning] An error occurred when attempting to receive the next message: "System.IndexOutOfRangeException: headers
   at System.Data.ProviderBase.FieldNameLookup.GetOrdinal(String fieldName)
   at System.Data.SqlClient.SqlDataReader.GetOrdinal(String name)
   at System.Data.SqlClient.SqlDataReader.get_Item(String name)
   at Rebus.SqlServer.Transport.SqlServerTransport.<Receive>d__27.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Rebus.Workers.ThreadPoolBased.ThreadPoolWorker.<ReceiveTransportMessage>d__17.MoveNext()"

2018-09-21 04:28:34.313 +09:30 [Warning] An error occurred when attempting to receive the next message: "System.Data.SqlClient.SqlException (0x80131904): The server failed to resume the transaction. Desc:48000008ac.
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at System.Data.SqlClient.SqlDataReader.get_MetaData()
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
   at System.Data.SqlClient.SqlCommand.CompleteAsyncExecuteReader()
   at System.Data.SqlClient.SqlCommand.InternalEndExecuteReader(IAsyncResult asyncResult, String endMethod)
   at System.Data.SqlClient.SqlCommand.EndExecuteReaderInternal(IAsyncResult asyncResult)
   at System.Data.SqlClient.SqlCommand.EndExecuteReaderAsync(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Rebus.SqlServer.Transport.SqlServerTransport.<Receive>d__27.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Rebus.Workers.ThreadPoolBased.ThreadPoolWorker.<ReceiveTransportMessage>d__17.MoveNext()
ClientConnectionId:7c4a02b7-95ed-4523-9028-3918bf7be1c6
Error Number:3971,State:1,Class:16"
mookid8000 commented 5 years ago

wow.... Rebus doesn't really do anything as such to try and detect sick connections, because I assumed that stuff like that would be handled by .NET's SQL connection pool

When you say

(...) [it] repeats indefinitely (...)

how long did you observe it repeat?

The default SQL connection pool size is 100, so maybe it "just" (*) needed to cycle through all 100 of them (if your application was busy up until this problem occurred) to have the connection pool renew the bad connections.

It would be awesome, of course, if it was possible to somehow help the connection pool recover in situations like this, but I am slightly worried that it could mess up other things. I assume that the connection pool was made by people much smarter than me, at least when it comes to SQL connection, so I would actually also have assumed that they handled flaky networks better than this.


(*) "just" as in: Very inconvenient, but still able to recover within a finite timeframe.

jezza323 commented 5 years ago

It was running about 4 hours before I was notified and restarted the app. There are over 1100 instances of the error in the log. Interestingly the same app was still processing other messages. It seemed only messages of some types were having problems.

Perhaps its more the issue that the connection died, and the transaction was still in scope in the app, so was attempting to be used across the newly supplied connection. Hopefully I will have time to go through the source more fully next week

mookid8000 commented 5 years ago

Hi @jezza323 , have you figured out if it was because you were somehow keeping the sick SQL connections in scope yourself?

mookid8000 commented 5 years ago

Hi again @jezza323 , what's the status of this?

mookid8000 commented 5 years ago

Hi @jezza323 , could you try and update to Rebus 5.0.1? It guards against accidentally re-using an already-completed transaction context, so it might help you solve your problem if you upgrade to that.

Closing this issue for now – please feel free to come back to it, if your problem was not solved.