microsoft / durabletask-mssql

Microsoft SQL storage provider for Durable Functions and the Durable Task Framework
MIT License
87 stars 32 forks source link

Heavy Payload lead to stuck orchestration #126

Closed bhugot closed 1 year ago

bhugot commented 2 years ago

Hello, we met a problem with sql provider as the payload is stored in table when he is too big it's truncated. and after that we have some timeout and nothing is processed. I think we should use the azure storage way to handle it by file in that case.

cgillum commented 2 years ago

@bhugot what's the size of your payload? The SQL provider uses varchar(MAX) for storing payloads, which supports up to 2 GB. Do you have a payload that's larger than this? Azure Storage may not be a great choice either for payloads this large.

bhugot commented 2 years ago

After more analyse it seems it s in fact more a timeout in that case but due too the payload size. I will get the stack trace

bhugot commented 2 years ago
TaskOrchestrationDispatcher-e8f908058dee4e5982ca5d1d14ad63e5-0: Unhandled exception with work item '04f96af8-39f6-4f63-8e15-e63936e2ea3d': Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): The wait operation timed out.
   at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.SqlCommand.InternalEndExecuteNonQuery(IAsyncResult asyncResult, Boolean isInternal, String endMethod)
   at Microsoft.Data.SqlClient.SqlCommand.EndExecuteNonQueryInternal(IAsyncResult asyncResult)
   at Microsoft.Data.SqlClient.SqlCommand.EndExecuteNonQueryAsync(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location ---
   at DurableTask.SqlServer.SqlUtils.WithRetry[T](Func`1 func, SprocExecutionContext context, LogHelper traceHelper, String instanceId, Int32 maxRetries) in /_/src/DurableTask.SqlServer/SqlUtils.cs:line 496
   at DurableTask.SqlServer.SqlUtils.WithRetry[T](Func`1 func, SprocExecutionContext context, LogHelper traceHelper, String instanceId, Int32 maxRetries) in /_/src/DurableTask.SqlServer/SqlUtils.cs:line 503
   at DurableTask.SqlServer.SqlUtils.ExecuteSprocAndTraceAsync[T](DbCommand command, LogHelper traceHelper, String instanceId, Func`2 executor) in /_/src/DurableTask.SqlServer/SqlUtils.cs:line 438
   at DurableTask.SqlServer.SqlOrchestrationService.CompleteTaskOrchestrationWorkItemAsync(TaskOrchestrationWorkItem workItem, OrchestrationRuntimeState newRuntimeState, IList`1 outboundMessages, IList`1 orchestratorMessages, IList`1 timerMessages, TaskMessage continuedAsNewMessage, OrchestrationState orchestrationState) in /_/src/DurableTask.SqlServer/SqlOrchestrationService.cs:line 375
   at DurableTask.Core.TaskOrchestrationDispatcher.OnProcessWorkItemAsync(TaskOrchestrationWorkItem workItem) in /_/src/DurableTask.Core/TaskOrchestrationDispatcher.cs:line 551
   at DurableTask.Core.TaskOrchestrationDispatcher.OnProcessWorkItemSessionAsync(TaskOrchestrationWorkItem workItem) in /_/src/DurableTask.Core/TaskOrchestrationDispatcher.cs:line 194
   at DurableTask.Core.WorkItemDispatcher`1.ProcessWorkItemAsync(WorkItemDispatcherContext context, Object workItemObj) in /_/src/DurableTask.Core/WorkItemDispatcher.cs:line 373
ClientConnectionId:a44b295e-5fe9-4952-8566-be45f2e88e22
Error Number:-2,State:0,Class:11
ClientConnectionId before routing:43d4b49d-060a-40a0-abe9-2a9df4786ec6
Routing Destination:b8c44c89e538.tr604.westeurope1-a.worker.database.windows.net,11044

Backing off for 1 seconds until 5 successful operations
bhugot commented 2 years ago

I have upscale database dtu and it's now ok. I will try to find the reason of the timeout but maybe a to big insert

bhugot commented 2 years ago

Hello I m back on this problem. We have some batch orchestration that are loading massive amount of data like 500K string , then call another orchestration for each one of them. It was working fine on Azure Storage but it's going IDLE with mssql. The payload is inserted in the table but nothing is done with it. I suppose because of the timeout. Maybe it should be a parameter that allow to change timeout of a command. that would solve this issue

bhugot commented 2 years ago

Ok by adding Command Timeout in connection string I solved the problem but maybe a hard coded change would be better

kevbry commented 1 year ago

@bhugot There's a massive performance issue in microsoft.data.sqlclient when reading large fields with async methods. If the payload is large it's probably tripping over that issue. The suggested workaround is to either use the streaming API if you want to stay async, or move anything that interacts with things like varchar(max) to sync. In either case that's a change in durabletask-mssql.

https://github.com/dotnet/SqlClient/issues/593

cgillum commented 1 year ago

Wow, thanks @kevbry for mentioning this! I'm very much open to using the sync overloads for the background* reads that we do from SQL. There really isn't a possibility for thread-pool starvation since parallelism is already throttled, so it feels like an easy performance win with little downside.

*By background, I'm talking about all SQL reads driven my IOrchestrationService methods. Anything called by IOrchestrationServiceClient should still use async since those calls are typically made by threads owned by the app.

kevbry commented 1 year ago

@cgillum I haven't done any profiling to see if it's actually the cause here, but seems possible. SqlUtil in this project seems to be using SqlClient in a similar way to the app we had that tripped over that same bug.

If there's anything at all you can do internally to raise the profile of the issue that'd probably be helpful (and appreciated by people like me who keep tripping over it). Reading the entire thread on 593 it sounds like the cause is understood, but the people who actually understand it can't get enough time allocated to fix it.

bhugot commented 1 year ago

Thanks @kevbry, indeed by running the performance tests on sql client all command synchronous method are faster and with less allocation than asynchronous one

Method Mean Error StdDev Completed Work Items Lock Contentions Gen0 Gen1 Gen2 Allocated
ExecuteReader 48.60 ms 4.094 ms 2.141 ms - - - - - 8.13 KB
ExecuteReaderAsync 80.30 ms 6.355 ms 4.204 ms 873.0000 - 1500.0000 - - 11194.37 KB
ExecuteScalar 38.24 ms 0.907 ms 0.539 ms - - - - - 8.11 KB
ExecuteScalarAsync 38.31 ms 0.694 ms 0.459 ms 5.0000 - - - - 10.8 KB
ExecuteNonQuery 37.94 ms 0.666 ms 0.440 ms - - - - - 6 KB
ExecuteNonQueryAsync 39.58 ms 3.188 ms 2.109 ms 3.0000 - - - - 7.42 KB
ExecuteXmlReader 158.77 ms 34.782 ms 20.698 ms - - - - - 1445.5 KB
ExecuteXmlReaderAsync 190.03 ms 29.453 ms 17.527 ms 244.0000 - - - - 474.11 KB
VARBINARY_MAX_ExecuteScalar 33.64 ms 0.750 ms 0.446 ms - - 500.0000 500.0000 500.0000 20481.99 KB
VARBINARY_MAX_ExecuteScalarAsync 2,940.72 ms 170.859 ms 101.676 ms 1320.5000 - 2000.0000 1000.0000 500.0000 31404.74 KB
kevbry commented 1 year ago

@bhugot do you still have the test suite you'd developed as part of this issue? They're looking for samples in the sqlclient issue I mentioned earlier.