microsoft / durabletask-mssql

Microsoft SQL storage provider for Durable Functions and the Durable Task Framework
MIT License
87 stars 32 forks source link

Timeout on PurgeOrchestrationInstanceHistoryAsync #137

Open lidong-zhao opened 1 year ago

lidong-zhao commented 1 year ago

It seems mssql provider uses default timeout of 30 seconds for PurgeOrchestrationInstanceHistoryAsync, how can we overwrite the default timeout value?

cgillum commented 1 year ago

Are you able to work around it by increasing the connection timeout in the SQL connection string that you're using? It's obviously not ideal, but it seems these APIs need to be updated to support longer timeouts.

lidong-zhao commented 1 year ago

Thanks, I could set command timeout in connection string, but that will affect all executions, can we add this support to these APIs?

cgillum commented 1 year ago

Yes, we can investigate whether it's possible to add a mechanism that allows you to control the timeout independent from the connection string timeout.

lidong-zhao commented 1 year ago

Other than the timeout, another option is to have transaction for each instance Id instead of having one big transaction for all isntance ids.

cgillum commented 1 year ago

Oh, I didn't realize we were using a single transaction for all the deletes! Yes, we had to make a similar fix in the Azure Storage provider for Durable Tasks. We should do something similar here. It probably won't be a transaction for each instance ID, but we can at least break it down into smaller batches so that you can at least complete the work by retrying until it's done.

lidong-zhao commented 1 year ago

Exactly! Thanks.

lidong-zhao commented 1 year ago

Another problem is that the following tables don't have index on TaskHub and InstanceId, the deletion will be very slow:

DELETE FROM NewEvents WHERE [TaskHub] = @TaskHub AND [InstanceID] IN (SELECT [InstanceID] FROM @InstanceIDs) DELETE FROM NewTasks WHERE [TaskHub] = @TaskHub AND [InstanceID] IN (SELECT [InstanceID] FROM @InstanceIDs) DELETE FROM Instances WHERE [TaskHub] = @TaskHub AND [InstanceID] IN (SELECT [InstanceID] FROM @InstanceIDs) DECLARE @deletedInstances int = @@ROWCOUNT DELETE FROM History WHERE [TaskHub] = @TaskHub AND [InstanceID] IN (SELECT [InstanceID] FROM @InstanceIDs) DELETE FROM Payloads WHERE [TaskHub] = @TaskHub AND [InstanceID] IN (SELECT [InstanceID] FROM @InstanceIDs)

cgillum commented 1 year ago

@lidong-zhao these operations should be using the clustered indexes, which are defined implicitly. For example, the NewEvents table defines the following primary key constraint, which includes both TaskHub and InstanceID:

CONSTRAINT PK_NewEvents PRIMARY KEY (TaskHub, InstanceID, SequenceNumber),

And here is the estimated query plan from SQL Server:

image

lidong-zhao commented 1 year ago

AH, most of them do have index on taskhub and instance, my mistake that I only checked dt.NewTasks and assumed the rest tables would be the same.