Open michaelplavnik opened 6 months ago
@cgillum - Would it be OK for me and Misha to attempt to fix this issue and propose a solution? Btw, Misha is my manager and we work together and are trying to use DTF for workflows.
@microrama yes, if you’re able to find and submit a solution for this, we would be more than happy to accept it. It would be great to have a fix for this issue.
Thanks @cgillum! This certainly is a bit more complicated but we will work with you for any approaches we have and of course test locally before and after hand as well.
This is what's happening: DurableTask.SqlServer.SqlUtils.WithRetry() gets called by SqlOrchestrationService.LockNextTaskOrchestrationWorkItemAsync() method. This has the DBCommand passed to it. The first run when the network is out, we get the SqlException which is checked by IsTransient() method and is logged in the method as traceHelper.TransientDatabaseFailure(). This is the first DurableTask.SqlServer error that we see as a warning. Now, while the retry loop is being executed, the second time around it appears we get an System.InvalidOperationException that is not evaluated as transient by the IsTransient() method. This exception is thrown back to the calling DurableTask.Core caller and that is logged as an error. Subsequent calls work great because the network blip worked out.
POC: With a simple console application, we were able to reproduce the InvalidOperationException by forcibly closing the SqlConnection before calling the ExecuteReaderAsync. It was hard to reproduce the first transient SqlException in the VM machine where we develop.
Bottomline: The first error is a SqlException error and is considered a transient but by the second attempt, it turns into a InvalidOperationException since the SqlConnection is closed at this point.
Fix: Pass the DBCommand to the WithRetry() method and let the method check the DBCommand connection state and open if needed before attempting executor again. If there is a SqlException opening the connection again, existing logic will retry or return fatal error back.
Next Steps: We will try out the code in our Sandbox environment and monitor.
The code change was tested in our cert environment and it appears the issue is now fixed. We will still see the Warning but the Error is gone since we check for SqlConnection State and open the connection on the command object as needed. This was tested in the CERT environment where on an average there is a network blip once or twice a day. Attaching screenshots:
Next steps - I will create a PR for this issue for your review. Thx @cgillum!
@cgillum - FYI. I created the PR few days back. Please review when you have time. https://github.com/microsoft/durabletask-mssql/pull/221
@cgillum - Greatly appreciate any help with reviewing the PR :-) https://github.com/microsoft/durabletask-mssql/pull/221
@microrama taking a look now - thanks!
Thanks @cgillum!!
SqlUtils retry logic is based on the idea that connection is live during retrying. But that is not always the case and it leads to the failure of retry loop.
Below is exception that triggered retry
And this is exception that caused retry to fail