Orchestration execution retry due to OperationCanceledException?

gha-zund commented 10 months ago

We encountered a problem with our application when using durable entities with mssql provider.

We have several durable entities which we use to synchronize some data processing work. Those entity operation do some CPU work but also several I/O call. For example inserting or deleting data from/to a SQL database (another than we use for the durable task hub). The metrics of that database show that CPU is on 100% for hours and database scaled out to the configured max count of vCores. That causes pretty high cost. Query performance insights in the Azure portal gives us the hint, that the load come from a SQL statement which is executed by an entity operation.

Although we did not find an actual explanation for this behavior, it feels like the application is catched in a loop somehow. Just an assumption yet.

In the log of the function app, we found many exceptions and traces related to cancellation scenarios. We consider that normal since our function app scales out an in all the time (which is desired). Note: UseGracefulShutdown setting of the durable task extensions is set to "true"

However, there was one trace which catched my attention. It says "execution will be aborted and will be retried". Is this the hint to a never ending retry-loop? Is there a waiting period between the retries? The interesting thing is: why should this even be retried? The log says "OperationCanceledException". Why retrying a cancellation? (is there some cancellation-exception handling missing?)

Here is the log I'm refering to:

@calculator@G300L251734: Function 'calculator (Entity)' was aborted. Reason: An internal error occurred while attempting to execute this function. The execution will be aborted and retried. Details: System.OperationCanceledException: A severe error occurred on the current command. The results, if any, should be discarded.
Operation cancelled by user.
The statement has been terminated.
---> System.Data.SqlClient.SqlException (0x80131904): A severe error occurred on the current command. The results, if any, should be discarded.
Operation cancelled by user.
The statement has been terminated.
at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
at System.Data.SqlClient.SqlCommand.CompleteAsyncExecuteReader()
at System.Data.SqlClient.SqlCommand.EndExecuteNonQueryInternal(IAsyncResult asyncResult)
at System.Data.SqlClient.SqlCommand.EndExecuteNonQuery(IAsyncResult asyncResult)
at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location ---
at Dapper.SqlMapper.ExecuteImplAsync(IDbConnection cnn, CommandDefinition command, Object param) in /_/Dapper/SqlMapper.Async.cs:line 662
at SmartCutterWorker.Common.Db.DbProvider.ExecuteAsync(String sql, Object parameters, CancellationToken ct) in D:\a\1\s\Source\SmartCutterWorker.Common\Db\DbProvider.cs:line 51
ClientConnectionId:c39e6ed0-62e9-4d9a-abe2-8a01a4fe1c09
ClientConnectionId before routing:2611d667-b938-476e-b743-e34749ced1d9
Routing Destination:d6ef48d26a9c.tr25058.westeurope1-a.worker.database.windows.net,11044
--- End of inner exception stack trace ---
at SmartCutterWorker.Common.Db.DbProvider.ExecuteAsync(String sql, Object parameters, CancellationToken ct) in D:\a\1\s\Source\SmartCutterWorker.Common\Db\DbProvider.cs:line 60
at SmartCutterWorker.Calculations.Commands.Calculation.PostprocessJobsCommand.ExecuteAsync(IDeviceInfo device, PreprocessOutput preprocessing, DateTimeOffset from, ILogger log, CancellationToken ct) in D:\a\1\s\Source\SmartCutterWorker.Calculations\Commands\Calculation\PostprocessJobsCommand.cs:line 85
at SmartCutterWorker.Calculations.Commands.Calculation.PostprocessJobsCommand.ExecuteAsync(IDeviceInfo device, PreprocessOutput preprocessing, DateTimeOffset from, ILogger log, CancellationToken ct) in D:\a\1\s\Source\SmartCutterWorker.Calculations\Commands\Calculation\PostprocessJobsCommand.cs:line 149
at SmartCutterWorker.Calculations.Commands.Calculation.DoCalculationCommand.<>c__DisplayClass11_0.<<DoCalculationsAsync>g__PostprocessJobsAsync|1>d.MoveNext() in D:\a\1\s\Source\SmartCutterWorker.Calculations\Commands\Calculation\DoCalculationCommand.cs:line 211
--- End of stack trace from previous location ---
at SmartCutterWorker.Calculations.Commands.Calculation.DoCalculationCommand.<>c__DisplayClass11_0.<<DoCalculationsAsync>g__ProcessJobsAsync|0>d.MoveNext() in D:\a\1\s\Source\SmartCutterWorker.Calculations\Commands\Calculation\DoCalculationCommand.cs:line 206
--- End of stack trace from previous location ---
at SmartCutterWorker.Calculations.Commands.Calculation.DoCalculationCommand.DoCalculationsAsync(IDeviceInfo deviceInfo, DateTimeOffset from, DateTimeOffset to, Boolean optimizeForInsert, ILogger log, CancellationToken ct, Boolean throwExceptions) in D:\a\1\s\Source\SmartCutterWorker.Calculations\Commands\Calculation\DoCalculationCommand.cs:line 280
at SmartCutterWorker.Calculations.DurableEntities.Calculator.Calculate() in D:\a\1\s\Source\SmartCutterWorker.Calculations\DurableEntities\Calculator.cs:line 376
at Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableEntityContext.Microsoft.Azure.WebJobs.Extensions.DurableTask.IDurableEntityContext.DispatchAsync[T](Object[] constructorParameters) in D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\ContextImplementations\DurableEntityContext.cs:line 537
at Microsoft.Azure.WebJobs.Host.Executors.VoidTaskMethodInvoker`2.InvokeAsync(TReflected instance, Object[] arguments) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\VoidTaskMethodInvoker.cs:line 20
at Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker`2.InvokeAsync(Object instance, Object[] arguments) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionInvoker.cs:line 52
at Microsoft.Azure.WebJobs.Extensions.DurableTask.TaskEntityShim.ProcessOperationRequestAsync(RequestMessage request) in D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\Listener\TaskEntityShim.cs:line 472
at Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableEntityContext.ThrowApplicationExceptionsIfAny() in D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\ContextImplementations\DurableEntityContext.cs:line 166
at Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableTaskExtension.<>c__DisplayClass83_0.<<EntityMiddleware>b__0>d.MoveNext() in D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\DurableTaskExtension.cs:line 1090
--- End of stack trace from previous location ---
at Microsoft.Azure.WebJobs.Host.Executors.TriggeredFunctionExecutor`1.<>c__DisplayClass7_0.<<TryExecuteAsync>b__0>d.MoveNext() in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\TriggeredFunctionExecutor.cs:line 50
--- End of stack trace from previous location ---
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.InvokeWithTimeoutAsync(IFunctionInvoker invoker, ParameterHelper parameterHelper, CancellationTokenSource timeoutTokenSource, CancellationTokenSource functionCancellationTokenSource, Boolean throwOnTimeout, TimeSpan timerInterval, IFunctionInstance instance) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 581
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithWatchersAsync(IFunctionInstanceEx instance, ParameterHelper parameterHelper, ILogger logger, CancellationTokenSource functionCancellationTokenSource) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 527
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 306
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 352
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.TryExecuteAsync(IFunctionInstance functionInstance, CancellationToken cancellationToken) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 108. IsReplay: False. HubName: workerCalculationsDurableTaskHub. AppName: zuend-iot-test-fn. SlotName: Production. ExtensionVersion: 2.13.0. SequenceNumber: 111.`
_Category: Host.Triggers.DurableTask
LogLevel: Warning
IsReplay: false_

Please note that: "@calculator@G300L251734" is the enty of an entity in our application. Obviously, there was an SQL error response from database. We have a catch-clause which checks for this SQL-error and wraps the SqlException in an OperationCanceledException. (shouldn't we do this?) OperationCanceledException (or its derivates like TaskCanceledException) are never catched in our application since we want the function runtime to handle it (to retry it later, after scale-in, on another function app instance)

Together with this one we also find logs like this:

@calculator@D300L320111: Failed to renew orchestration work item: System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at DurableTask.SqlServer.SqlOrchestrationService.GetAndOpenConnectionAsync(CancellationToken cancelToken) in /_/src/DurableTask.SqlServer/SqlOrchestrationService.cs:line 87
   at DurableTask.SqlServer.SqlOrchestrationService.RenewTaskOrchestrationWorkItemLockAsync(TaskOrchestrationWorkItem workItem) in /_/src/DurableTask.SqlServer/SqlOrchestrationService.cs:line 318
   at DurableTask.Core.TaskOrchestrationDispatcher.RenewUntil(TaskOrchestrationWorkItem workItem, IOrchestrationService orchestrationService, LogHelper logHelper, String dispatcher, CancellationToken cancellationToken) in /_/src/DurableTask.Core/TaskOrchestrationDispatcher.cs:line 661`
_Category: DurableTask.Core
EventName: RenewOrchestrationWorkItemFailed
LogLevel: Warning_

or:

@calculator@D300L320111: Function 'calculator (Entity)' failed 'Calculate' operation 15528046-fa5c-425f-a0f5-32bfdb29001b after 768689.3578ms with exception System.OperationCanceledException: A severe error occurred on the current command.  The results, if any, should be discarded.
Operation cancelled by user.
The statement has been terminated.
 ---> System.Data.SqlClient.SqlException (0x80131904): A severe error occurred on the current command.  The results, if any, should be discarded.
Operation cancelled by user.
The statement has been terminated.
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
   at System.Data.SqlClient.SqlCommand.CompleteAsyncExecuteReader()
   at System.Data.SqlClient.SqlCommand.EndExecuteNonQueryInternal(IAsyncResult asyncResult)
   at System.Data.SqlClient.SqlCommand.EndExecuteNonQuery(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location ---
   at Dapper.SqlMapper.ExecuteImplAsync(IDbConnection cnn, CommandDefinition command, Object param) in /_/Dapper/SqlMapper.Async.cs:line 662
   at SmartCutterWorker.Common.Db.DbProvider.ExecuteAsync(String sql, Object parameters, CancellationToken ct) in D:\a\1\s\Source\SmartCutterWorker.Common\Db\DbProvider.cs:line 51
ClientConnectionId:018f201f-a0c1-41f1-a679-5529d6cae712
ClientConnectionId before routing:f57ac0f3-ea30-42a5-8fa0-4d4652060dd1
Routing Destination:d6ef48d26a9c.tr25058.westeurope1-a.worker.database.windows.net,11044
   --- End of inner exception stack trace ---
   at SmartCutterWorker.Common.Db.DbProvider.ExecuteAsync(String sql, Object parameters, CancellationToken ct) in D:\a\1\s\Source\SmartCutterWorker.Common\Db\DbProvider.cs:line 60
   at SmartCutterWorker.Calculations.Commands.Calculation.PostprocessJobsCommand.ExecuteAsync(IDeviceInfo device, PreprocessOutput preprocessing, DateTimeOffset from, ILogger log, CancellationToken ct) in D:\a\1\s\Source\SmartCutterWorker.Calculations\Commands\Calculation\PostprocessJobsCommand.cs:line 85
   at SmartCutterWorker.Calculations.Commands.Calculation.PostprocessJobsCommand.ExecuteAsync(IDeviceInfo device, PreprocessOutput preprocessing, DateTimeOffset from, ILogger log, CancellationToken ct) in D:\a\1\s\Source\SmartCutterWorker.Calculations\Commands\Calculation\PostprocessJobsCommand.cs:line 149
   at SmartCutterWorker.Calculations.Commands.Calculation.DoCalculationCommand.<>c__DisplayClass11_0.<<DoCalculationsAsync>g__PostprocessJobsAsync|1>d.MoveNext() in D:\a\1\s\Source\SmartCutterWorker.Calculations\Commands\Calculation\DoCalculationCommand.cs:line 211
--- End of stack trace from previous location ---
   at SmartCutterWorker.Calculations.Commands.Calculation.DoCalculationCommand.<>c__DisplayClass11_0.<<DoCalculationsAsync>g__ProcessJobsAsync|0>d.MoveNext() in D:\a\1\s\Source\SmartCutterWorker.Calculations\Commands\Calculation\DoCalculationCommand.cs:line 206
--- End of stack trace from previous location ---
   at SmartCutterWorker.Calculations.Commands.Calculation.DoCalculationCommand.DoCalculationsAsync(IDeviceInfo deviceInfo, DateTimeOffset from, DateTimeOffset to, Boolean optimizeForInsert, ILogger log, CancellationToken ct, Boolean throwExceptions) in D:\a\1\s\Source\SmartCutterWorker.Calculations\Commands\Calculation\DoCalculationCommand.cs:line 280
   at SmartCutterWorker.Calculations.DurableEntities.Calculator.Calculate() in D:\a\1\s\Source\SmartCutterWorker.Calculations\DurableEntities\Calculator.cs:line 376
   at Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableEntityContext.Microsoft.Azure.WebJobs.Extensions.DurableTask.IDurableEntityContext.DispatchAsync[T](Object[] constructorParameters) in D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\ContextImplementations\DurableEntityContext.cs:line 537
   at Microsoft.Azure.WebJobs.Host.Executors.VoidTaskMethodInvoker`2.InvokeAsync(TReflected instance, Object[] arguments) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\VoidTaskMethodInvoker.cs:line 20
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker`2.InvokeAsync(Object instance, Object[] arguments) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionInvoker.cs:line 52
   at Microsoft.Azure.WebJobs.Extensions.DurableTask.TaskEntityShim.ProcessOperationRequestAsync(RequestMessage request) in D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\Listener\TaskEntityShim.cs:line 472. Input: (33112 bytes). IsReplay: False. HubName: workerCalculationsDurableTaskHub. AppName: zuend-iot-test-fn. SlotName: Production. ExtensionVersion: 2.13.0. SequenceNumber: 10.`
_Category: Host.Triggers.DurableTask
LogLevel: Error
isReplay: false_

an many scale recommendations, e.g.:

Recommending a replica count change from 3 to 2.
_Category: DurableTask.SqlServer
EventName: ReplicaCountChangeRecommended
LogLevel: Information_

as well es errors from scale monitor:

Failed to collect scale metrics sample for monitor 'DurableTask-SqlServer:workerCalculationsDurableTaskHub'.`
_Category: Microsoft.Azure.WebJobs.Host.Scale.ScaleMonitorService
LogLevel: Error_

in the exception table on application insights a found (as expected) lots of TaskCanceledExcpetion occurrences. From the SqlOrchestrationService:

_Type: System.Threading.Tasks.TaskCanceledException
method: DurableTask.SqlServer.SqlOrchestrationService+<GetAndOpenConnectionAsync>d__13.MoveNext
Formatted message: Failed to collect scale metrics sample for monitor 'DurableTask-SqlServer:workerCalculationsDurableTaskHub'.
Category: Microsoft.Azure.WebJobs.Host.Scale.ScaleMonitorService
LogLevel: Error_

(no clue how to get the stack trace as a simple string from application insights)

and of course the OperationCanceledException from our code (we log it an then re-throw it)

_type: System.Data.SqlClient.SqlException
outerType: System.OperationCanceledException
outerMessage: `A severe error occurred on the current command. The results, if any, should be discarded.
Operation cancelled by user.
The statement has been terminated.`
Category: Function.Calculator (<- this is our entity-trigger-function)
EventName: FunctionCompleted
LogLevel: Error
status: Failed_

One last thing: We assume that the issue (heavy load on the DB, from the commands made by entites) has something to do with mssql-provider since we never experience this behavior with the same application with azureStorage provider or netherite.

cgillum commented 10 months ago

There's a lot to sift through here. I'll respond to a couple things you mentioned:

The interesting thing is: why should this even be retried? The log says "OperationCanceledException". Why retrying a cancellation? (is there some cancellation-exception handling missing?)

I'm not aware of any case in which Azure Functions or the Durable Task Framework handles OperationCanceledException from user code. We only handle that exception if it's raised by our internal code. I don't think you should expect the underlying framework to handle it any different, because we can't make any assumptions about what caused it if it's not code that we control. For that reason, retrying the operation is the "right behavior" per the current design. If you don't want the operation to be retried, then you should not be rethrowing your SQL exception at all.

I suspect the other errors you're seeing are related to the 100% CPU usage.

We assume that the issue (heavy load on the DB, from the commands made by entites) has something to do with mssql-provider since we never experience this behavior with the same application with azureStorage provider or netherite.

It's certainly possible that the high CPU usage by your app code could negatively impact the behavior of the durabletask-mssql provider code, or vise-versa. If you're maxing out on vCores, would it make sense to separate these into two different databases?

gha-zund commented 9 months ago

Hi @cgillum

The databases are already separated, they both are serverless Azure Sql database, even on different server instances. The durable task database seems to be pretty relaxed, according to the metrics.

We already experimented with scaling out the affected database. As we increased the maximum count of vCores (to 10), the CPU metric dropped to a normal range. However, the available resources were not fully utilized so we returned to the previous configuration (max 6 vCores). Then it happened again...

Well you could say that this means that we have to scale out and everything is fine. But we are kind of startled. On our productive system we have max 14 vCores with azureStorage-provider, but the activity of the software is nearly ten times the amount, than on our test system. The other thing is, with the same version of the software, with the same infrastructure but azureStorage-provider instead of mssql we have never experienced such metrics on the database.

I opened the issue in hope for an explanation :)

cgillum commented 9 months ago

@gha-zund if the databases are separate, and the only difference is that you're using a different storage provider (MSSQL vs. Azure Storage) then another possible explanation is that there's a difference in behavior in terms of orchestration and activity throughput and/or concurrency that's causing the vCore usage difference. For example, it could be that you're running more orchestrations and activities per second compared to your previous configuration, which could explain why your app database vCore usage has increased.

If you want to throttle the orchestration executions, you can do things like adjusting the concurrency settings, reducing them until you're able to maintain an acceptable range of database vCore usage on your app. This can be configured in your host.json file via the maxConcurrentActivityFunctions and maxConcurrentOrchestratorFunctions settings. See here for the host.json configuration reference.

gha-zund commented 9 months ago

Thanks for your input!

We already played a lot with the concurrency throttles. Seems like we have not found a good combination of those config and scaling yet.

Based on your answers, I understand that there is not misbehavior or malfunction caused by mssql provider. Great to hear :)

microsoft / durabletask-mssql

Orchestration execution retry due to OperationCanceledException? #208