microsoft / durabletask-mssql

Microsoft SQL storage provider for Durable Functions and the Durable Task Framework
MIT License
87 stars 32 forks source link

Execution slowed by deadlocks when using slow Kubernetes cluster #17

Closed cgillum closed 3 years ago

cgillum commented 3 years ago

The following exceptions were seen in the Docker logs when running a 100 orchestration "HelloSequences" test on a Kubernetes cluster.

warn: DurableTask.SqlServer[308]
      20210413-111157-0000000000000019: A transient database failure occurred and will be retried. Current retry count: 0. Details: Microsoft.Data.SqlClient.SqlException (0x80131904): Transaction (Process ID 102) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
         at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
         at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
         at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
         at Microsoft.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString, Boolean isInternal, Boolean forDescribeParameterEncryption, Boolean shouldCacheForAlwaysEncrypted)
         at Microsoft.Data.SqlClient.SqlCommand.CompleteAsyncExecuteReader(Boolean isInternal, Boolean forDescribeParameterEncryption)
         at Microsoft.Data.SqlClient.SqlCommand.InternalEndExecuteNonQuery(IAsyncResult asyncResult, Boolean isInternal, String endMethod)
         at Microsoft.Data.SqlClient.SqlCommand.EndExecuteNonQueryInternal(IAsyncResult asyncResult)
         at Microsoft.Data.SqlClient.SqlCommand.EndExecuteNonQueryAsync(IAsyncResult asyncResult)
         at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
      --- End of stack trace from previous location where exception was thrown ---
         at DurableTask.SqlServer.SqlUtils.WithRetry[T](Func`1 func, SprocExecutionContext context, LogHelper traceHelper, String instanceId, Int32 maxRetries) in /durabletask-mssql/src/DurableTask.SqlServer/SqlUtils.cs:line 420
      ClientConnectionId:1bfd1928-097b-4daf-8bff-59a0c90cd87c
      Error Number:1205,State:51,Class:13.

Deadlocks are not normally seen when running on fast hardware, so this might be something that was missed during local testing. Apps in this cluster were running slowly, resulting in a variety of other issues. The slowness of this cluster is likely a contributor to the problem. Also, the app was scaled out to 5 replicas using KEDA.

Even for slowly running clusters, the sprocs should be designed in such a way that simple scenarios like these should not result in deadlocks.