Fix orchestration stuck using anyOf/allOf with retry policy

kaibocai commented 1 year ago

Issue describing the changes in this PR

resolves https://github.com/microsoft/durabletask-java/issues/149

Pull request checklist

[X] My changes do not require documentation changes
- [ ] Otherwise: Documentation issue linked to PR
[X] My changes are added to the CHANGELOG.md
[X] I have added all required tests (Unit tests, E2E tests)

Additional information

Additional PR information

kaibocai commented 1 year ago

shouldn't we also add a test for cases where at least one function call fails and is retried

@cgillum, it turns out the current approach failed on this case. The thing is when customers using anyOf/allOF for example

        tasks.add(ctx.callActivity("Append", "InputSad1", taskOptions, String.class));
        tasks.add(ctx.callActivity("Append", "InputSad1", taskOptions, String.class));
        tasks.add(ctx.callActivity("Append", "InputSad1", taskOptions, String.class));
        return ctx.allOf(tasks).await();

the allOf here returns a CompletableTask https://github.com/microsoft/durabletask-java/blob/b919929a0503f24ea5ac7a7e9d593a43a9dbcbdd/client/src/main/java/com/microsoft/durabletask/TaskOrchestrationExecutor.java#L188, which is not retriable and we are awaiting on it. So all the RetriableTask tasks we put in the list won't be retried as the task we await on is a CompletableTask.

So I am thinking let allOf return CompletableTask or RetriableTask depending on the tasks it has in the collection. However, there is another issue comes, as the collection of tasks can be mixed with CompletableTask and RetriableTask. In that case, I cannot run retry logic only on those RetriableTask in the collection. I can only run (or not run) retry logic on all of the tasks altogether. But maybe this is a corner case and it doesn't really matter to customers that few of their non-retriable tasks get retried?

To really solve this issue, I think we should have a new GRPC type dedicate for RetriableTask, so the SDK can decide whether it should retry based on the message sent from the sidecar.

cc: @davidmrdavid

cgillum commented 1 year ago

But maybe this is a corner case and it doesn't really matter to customers that few of their non-retriable tasks get retried?

I think this will be problematic in large fan-out cases. If 1/100 activities fails, the customer will expect that we only retry the 1 that failed and not the other 99. I don't see this as a corner case. I think we should take the time necessary to figure it out, though I'm okay with doing a partial fix since the current problem of getting stuck is worse than the problem of too much retrying.

To really solve this issue, I think we should have a new GRPC type dedicate for RetriableTask, so the SDK can decide whether it should retry based on the message sent from the sidecar.

The gRPC layer has no notion of tasks, and this is by design. Tasks are purely an SDK concept, so there isn't any way to model it in the gRPC layer even if we wanted to. I can make time to try and help identify the correct design, but it will probably have to wait until after I get back from my vacation.

kaibocai commented 1 year ago

Close this PR as a new one is created https://github.com/microsoft/durabletask-java/pull/157

microsoft / durabletask-java