microsoft / durabletask-mssql

Microsoft SQL storage provider for Durable Functions and the Durable Task Framework
MIT License
87 stars 32 forks source link

datediff function resulted in an overflow #69

Closed marcd123 closed 2 years ago

marcd123 commented 2 years ago

We use the durabletask-mssql extension for Azure Functions in a multi-tenant database. We use user connection strings to separate our tenants, which are Live Dev Env, Live Test Env, and individual ones for our developers.

Yesterday our Live Dev Env stopped responding to requests and our Azure Functions host is reporting this error below over and over again when the host tries to startup. This causes it to never be able to accept any requests or fully startup. This has not affected our developer's local environments (unique task hubs in DB), or our Live Test Env (also has it's unique task hub in the same DB). No unique changes were made to the dev environment to cause this.

It is stating that TaskActivityDispatcher has Failed to fetch a work-item because of a SQL exception The datediff function resulted in an overflow. The number of dateparts separating two date/time instances is too large. Try to use datediff with a less precise datepart.

Because we don't do any manual interaction with the MSSQL queue database ourselves, it seems to be coming from the SQL Connection or Durable Task extension of Azure Functions.

@cgillum FYI we've been building this solution with Azure Durable Functions and the MSSQL Task Hub extension for over 6 months and we've been absolutely loving its capabilities <3 Especially when you add KEDA! We're moving this solution into production to and scaling the solution globally very soon so it is crucial we get this resolved, understand the cause of the issue, and create a strategy for preventing/triaging this issue in the future. Any advice you can offer to help here is much appreciated.

Details:

Full Exception:

ClientConnectionId:5093d664-5c8b-4308-a4a1-3c863255c4c6
Error Number:535,State:0,Class:16
ClientConnectionId before routing:6ff9c21f-29a9-43f0-8a7d-72f5e7779702
Routing Destination:bb08fcc51e7d.HS2.tr6036.centralus1-a.worker.database.windows.net,11005
fail: DurableTask.Core[25]
      TaskActivityDispatcher-37b55ccda45246ec9acf45def0121d13-0: Failed to fetch a work-item: Microsoft.Data.SqlClient.SqlException (0x80131904): The datediff function resulted in an overflow. The number of dateparts separating two date/time instances is too large. Try to use datediff with a less precise datepart.
         at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
         at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
         at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
         at Microsoft.Data.SqlClient.SqlDataReader.TryHasMoreRows(Boolean& moreRows)
         at Microsoft.Data.SqlClient.SqlDataReader.TryReadInternal(Boolean setTimeout, Boolean& more)
         at Microsoft.Data.SqlClient.SqlDataReader.ReadAsyncExecute(Task task, Object state)
         at Microsoft.Data.SqlClient.SqlDataReader.InvokeAsyncCall[T](AAsyncCallContext`1 context)
      --- End of stack trace from previous location where exception was thrown ---
         at DurableTask.SqlServer.SqlOrchestrationService.LockNextTaskActivityWorkItem(TimeSpan receiveTimeout, CancellationToken cancellationToken) in /_/src/DurableTask.SqlServer/SqlOrchestrationService.cs:line 332
         at DurableTask.Core.WorkItemDispatcher`1.DispatchAsync(WorkItemDispatcherContext context) in C:\source\durabletask\src\DurableTask.Core\WorkItemDispatcher.cs:line 262
      ClientConnectionId:5093d664-5c8b-4308-a4a1-3c863255c4c6
      Error Number:535,State:0,Class:16
      ClientConnectionId before routing:6ff9c21f-29a9-43f0-8a7d-72f5e7779702
      Routing Destination:bb08fcc51e7d.HS2.tr6036.centralus1-a.worker.database.windows.net,11005
Microsoft.Data.SqlClient.SqlException (0x80131904): The datediff function resulted in an overflow. The number of dateparts separating two date/time instances is too large. Try to use datediff with a less precise datepart.
   at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at Microsoft.Data.SqlClient.SqlDataReader.TryHasMoreRows(Boolean& moreRows)
   at Microsoft.Data.SqlClient.SqlDataReader.TryReadInternal(Boolean setTimeout, Boolean& more)
   at Microsoft.Data.SqlClient.SqlDataReader.ReadAsyncExecute(Task task, Object state)
   at Microsoft.Data.SqlClient.SqlDataReader.InvokeAsyncCall[T](AAsyncCallContext`1 context)
--- End of stack trace from previous location where exception was thrown ---
   at DurableTask.SqlServer.SqlOrchestrationService.LockNextTaskActivityWorkItem(TimeSpan receiveTimeout, CancellationToken cancellationToken) in /_/src/DurableTask.SqlServer/SqlOrchestrationService.cs:line 332
   at DurableTask.Core.WorkItemDispatcher`1.DispatchAsync(WorkItemDispatcherContext context) in C:\source\durabletask\src\DurableTask.Core\WorkItemDispatcher.cs:line 262

Extesions.csproj:

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <TargetFramework>netcoreapp3.1</TargetFramework>
    <WarningsAsErrors></WarningsAsErrors>
    <DefaultItemExcludes>**</DefaultItemExcludes>
  </PropertyGroup>

  <!-- Need to rename extensions.deps.json to function.deps.json for the native SQL dependencies to get loaded -->
  <Target Name="PostBuild" AfterTargets="PostBuildEvent">
    <Move SourceFiles="$(OutDir)/extensions.deps.json" DestinationFiles="$(OutDir)/function.deps.json" />
  </Target>

  <ItemGroup>
    <PackageReference Include="Microsoft.Azure.WebJobs.Extensions.DurableTask" Version="2.5.0" />
    <PackageReference Include="Microsoft.Azure.WebJobs.Extensions.SignalRService" Version="1.5.0" />
    <PackageReference Include="Microsoft.Azure.WebJobs.Script.ExtensionsMetadataGenerator" Version="1.2.3" />
    <PackageReference Include="Microsoft.DurableTask.SqlServer.AzureFunctions" Version="0.10.0-beta" />
  </ItemGroup>
</Project>
cgillum commented 2 years ago

@marcd123 thanks for reporting this! Let me investigate to see if I can determine the potential root cause and identify a temporary workaround. I'm guessing there may be a SQL stored procedure that's making bad assumptions about date/times, in which case we might be able to live-patch it to get things working again.

marcd123 commented 2 years ago

That's awesome thanks @cgillum. And just as a reminder, this is only affecting one of our task hubs in the database. The affected task hub is definitely the one that gets the most usage. Here is a count of history table records by taskhub - aksdev is the effected hub and the other ones in the picture are working as normal:

image
cgillum commented 2 years ago

@marcd123 I did some research on this error, and it seems to me that this can happen if you have rows in either the dt.NewEvents table or the dt.NewTasks table that have a [Timestamp] value that's 24 days ago or more. This is because we try to calculate the "age" of a row in milliseconds when selecting new work-items, and 24+ days in milliseconds is larger than what can fit in an int. In your case, the callstack leads me to believe that you have rows in the dt.NewTasks table with old timestamps.

Now, it's not exactly normal for any row to exist in the table that long so I'm curious to know why you're in this situation. I'm glad to hear only one of your tenants is impacted by this issue, which means it's likely not an issue where one tenant can bring down the others. However, clearly something must have happened in the dev tenant.

As the next step in the investigation, could you run the following query in your database to help identify which message(s) might be causing problems (you'll need to run it with an account that has permission to read from the tables directly)?

SELECT [TaskHub], [InstanceID], [Name], [Timestamp], [VisibleTime], [LockExpiration]
FROM dt.[NewTasks]
WHERE [Timestamp] < '2022-01-18' -- ~24 days or older

Assuming one or more rows come back, some quick mitigation strategies might include manually deleting those rows, updating those timestamps, or some other option we can think of. Subsequently I can work on a fix for the underlying root cause so that you can be confident that this won't happen in production.

Alternatively, we could try to go with a private code fix first so that you can test it against this live repro and validate that it correctly mitigates the problem.

marcd123 commented 2 years ago

Hey @cgillum

NewTasks was empty. NewEvents did have several records that old for the affected tenant aksdev, but there were events that old for other tenants/taskhubs that did not have the issue. I've included a sample of the records below.

I deleted these records, and the durable function app hosts are no longer reporting the SQL exception but are still not responding to any requests. They're 500 errors (for the HTTP orchestrator client functions) but it doesn't seem like they're even receiving the events now. (I have these records backed up btw if you need them).

Old Records Sample:

  {
    "TaskHub": "aksdev",
    "InstanceID": "0e56bd1dfdf04d6b9bbc992ed7dd61b9",
    "Name": null,
    "Timestamp": "2021-12-17T22:01:33.8171317",
    "VisibleTime": null
  },
  {
    "TaskHub": "aksdev",
    "InstanceID": "0e56bd1dfdf04d6b9bbc992ed7dd61b9",
    "Name": null,
    "Timestamp": "2021-12-17T22:01:40.1765132",
    "VisibleTime": null
  },
  {
    "TaskHub": "aksdev",
    "InstanceID": "0e56bd1dfdf04d6b9bbc992ed7dd61b9",
    "Name": null,
    "Timestamp": "2021-12-17T22:01:46.7390223",
    "VisibleTime": null
  },
  {
    "TaskHub": "aksdev",
    "InstanceID": "0e56bd1dfdf04d6b9bbc992ed7dd61b9",
    "Name": null,
    "Timestamp": "2021-12-17T22:01:50.3640409",
    "VisibleTime": null
  },
marcd123 commented 2 years ago

We had a separate issue with our AKS Load Balancer, now that it is resolved we're able to make requests again.

For the old NewEvents and NewTasks rows, do you have any thoughts on how these events got stuck in the queue for so long and how we could prevent that?

I know the durable client context object in Python is able to clear some of the queue data but haven't been able to find good examples for Python, if you have any please let me know.

cgillum commented 2 years ago

That's odd that your NewTasks didn't have anything because I don't see how you'd see that particular call stack otherwise. In any case, I'm going to merge a PR that should prevent that particular error from happening again.

do you have any thoughts on how these events got stuck in the queue for so long and how we could prevent that?

For NewEvents, one possibility is if you're using the feature that allows scheduling the start of an orchestration into the distant future. However, I don't believe this feature is currently exposed to Durable Functions. Another possibility is that these events are effectively "poison" events that can never be processed for some reasons, though I expect that would have a pretty noticeable impact on the whole tenant since these events are read in a deterministic order. Looking into the Durable Task Framework logs and enabling the DurableTask.SqlServer source like we do here could help you rule out that one. As I mentioned earlier, having an app disabled or deprovisioned for a long time and then bringing it back up would be the other possibility that I can think of.

I know the durable client context object in Python is able to clear some of the queue data but haven't been able to find good examples for Python, if you have any please let me know.

Unfortunately, we don't have any Python samples for doing something like this.

cgillum commented 2 years ago

@marcd123 I've merged a change that will prevent this particular error from occurring in the future. Happy to continue discussing this particular case with you, though.