await operator on non-OrchestrationContext tasks leaves Instance in perpetual running state

mthelen-taslar commented 1 year ago

Hi,

We recently wrote an orchestration that await'd tasks outside of the orchestration context's interface. A simple example that can reproduce this issue is the following:

    public class ParentOrchestration : TaskOrchestration<object, object>
    {
        public override async Task<object> RunTask(OrchestrationContext context, object input)
        {
            await Task.Delay(5_000);
            return Task.FromResult(new object());
        }
    }

We noticed that execution was returned back to the TaskOrchestrationExecutor before the orchestration ended, which appended an OrchestratorCompleted event to the system but the instance was left in a Running state. I am interpreting the behavior as 'any await operation on a function not wrapped by a task or sub-orchestration can leave an orchestration in a perpetual running state'. I understand an orchestration can't wait for days when worker hosts are bounced regularly, but this also strikes me as surprising behavior for our use case: an idempotent orchestration to guarantee a handful of network requests succeed. Is my understanding correct? And if so, is this a bug or expected behavior? Static analysis/linter rules could mitigate confusion for future adopters if this is the case.

We are using Microsoft.DurableTask.SqlServer version=1.1.1

The state of the dt.* tables after running one of these workflows is:

SELECT * FROM dt.Instances Order By CreatedTime DESC;

TaskHub InstanceID  ExecutionID Name    Version CreatedTime LastUpdatedTime CompletedTime   RuntimeStatus   LockedBy    LockExpiration  InputPayloadID  OutputPayloadID CustomStatusPayloadID   ParentInstanceID

dbo local:19a10861-1b1b-4c8e-9311-f14f084c740b  2ba4e9d7d71449f6b1522a9d01ddd8d1    CoreService.Workflow.Orchestrations.ParentOrchestration     2023-08-31 21:17:50.4612890 2023-08-31 21:17:53.4772917 NULL    Running DESKTOP-5J7VGCN,9524    NULL    NULL    NULL    NULL    NULL

SELECT * FROM dt.NewTasks;

(no results)

SELECT * FROM dt.NewEvents;

(no results)

SELECT * FROM dt.vHistory;

TaskHub InstanceID  ExecutionID SequenceNumber  EventType   TaskID  Timestamp   IsPlayed    Name    RuntimeStatus   VisibleTime Payload
dbo local:19a10861-1b1b-4c8e-9311-f14f084c740b  2ba4e9d7d71449f6b1522a9d01ddd8d1    0   OrchestratorStarted NULL    2023-08-31 21:17:50.5400594 0   NULL    NULL    NULL    NULL
dbo local:19a10861-1b1b-4c8e-9311-f14f084c740b  2ba4e9d7d71449f6b1522a9d01ddd8d1    1   ExecutionStarted    NULL    2023-08-31 21:17:50.4612890 1   CoreService.Workflow.Orchestrations.ParentOrchestration NULL    NULL    NULL
dbo local:19a10861-1b1b-4c8e-9311-f14f084c740b  2ba4e9d7d71449f6b1522a9d01ddd8d1    2   OrchestratorCompleted   NULL    2023-08-31 21:17:53.4113994 0   NULL    NULL    NULL    NULL

cgillum commented 1 year ago

Hi @mthelen-taslar, this is a known (and unfortunate) behavior of the Durable Task Framework and is the same whether you use this MSSQL backend or other backends.

I understand an orchestration can't wait for days when worker hosts are bounced regularly, but this also strikes me as surprising behavior for our use case: an idempotent orchestration to guarantee a handful of network requests succeed.

I apologize for not quite understanding the question. However, I can tell you this:

Orchestrations can wait for days if you await on a task defined on the context object. For example, the following is perfectly acceptable:
```
await context.CreateTimer(context.CurrentUtcDateTime.AddDays(3));
```
Any time you need to do async work that doesn't involve the context object, it must be done in a TaskActivity.

You're right that static analyzers can help here. In fact, we have one for the Durable Functions .NET SDK. Unfortunately, one hasn't been developed for the base Durable Task Framework yet.

Another improvement would be to add more runtime checks. I believe some exist if you try to call a context method after doing an "illegal await" but the logic doesn't current catch the case you've shown above. But even so, the damage will have already been done - i.e. the orchestration will still get stuck in the Running state. Static analysis is likely going to be the best way to prevent developers into running into this.

mthelen-taslar commented 1 year ago

Understood! Your response makes sense. We'll follow your guidance and keep an eye out for improvements in the future.

Feel free to close this issue. Thanks for the quick turnaround.

microsoft / durabletask-mssql

await operator on non-OrchestrationContext tasks leaves Instance in perpetual running state #186