microsoft / durabletask-java

Java SDK for Durable Functions and the Durable Task Framework
MIT License
13 stars 7 forks source link

Orchestrator function doesn't wait for external event if timer is more than 3days #136

Closed kanupriya15025 closed 10 months ago

kanupriya15025 commented 1 year ago

I am trying to send an event to my eternal orchestrator function while my function is sleeping with a timer. If the timer is more than 3days, the orchestrator function doesn't read the message and ignores it. Here's my code :

            Task<JsonNode> patchEvent = ctx.waitForExternalEvent("patchEvent", JsonNode.class);
            Task<?> sleepTask = ctx.createTimer(timeToStartDate);
            Task<?> winnerEvent = ctx.anyOf(patchEvent, sleepTask).await();
            if (winnerEvent == patchEvent) {
                jsonBody = patchEvent.await();
                if (!ctx.getIsReplaying()) {
                    LOGGER.info("Patch received before start of schedule. JSON {}", jsonBody);
                    ctx.continueAsNew(jsonBody);
                    return;
                }
            } else {
                LOGGER.info("Timer elapsed. No patch event received");
            }
        }

If the duration of the sleep task is less than 3days, it reads the patchEvent and executes the first "if" block. However, if the timer is more than 3days, the event is read and then discarded.

Here are the logs:

  1. Timer less than 3days :

[2023-05-23T08:51:09.400Z] 14:21:09.399 [pool-2-thread-8] INFO com.adobe.orchestration.EternalOrchestrator - kanu_java_test_1: startDate is: 2023-05-26T00:00:00.000-05:00 [2023-05-23T08:51:09.400Z] 14:21:09.399 [pool-2-thread-8] INFO com.adobe.orchestration.EternalOrchestrator - kanu_java_test_1: Workflow startDate is in future, Starting a timer for duration: PT68H9M2S seconds [2023-05-23T08:51:09.400Z] 14:21:09.399 [pool-2-thread-8] INFO com.adobe.orchestration.EternalOrchestrator - Patch received before start of schedule. JSON {JSON Body}


2. Timer greater than 3days :

[2023-05-23T08:51:18.054Z] Executing 'Functions.EternalOrchestrator' (Reason='(null)', Id=a429e287-8e6b-43a5-8b4e-53189c221bd6) [2023-05-23T08:51:18.060Z] 14:21:18.060 [pool-2-thread-8] INFO com.adobe.orchestration.EternalOrchestrator - kanu_java_test_1: startDate is: 2023-05-27T00:00:00.000-05:00 [2023-05-23T08:51:18.061Z] 14:21:18.060 [pool-2-thread-8] INFO com.adobe.orchestration.EternalOrchestrator - kanu_java_test_1: Workflow startDate is in future, Starting a timer for duration: PT92H8M50S seconds [2023-05-23T08:51:18.061Z] Function "EternalOrchestrator" (Id: a429e287-8e6b-43a5-8b4e-53189c221bd6) invoked by Java Worker [2023-05-23T08:51:18.064Z] Executed 'Functions.EternalOrchestrator' (Succeeded, Id=a429e287-8e6b-43a5-8b4e-53189c221bd6, Duration=12ms)

kanupriya15025 commented 1 year ago

I have a feeling that this is related to this change : https://github.com/kamperiadis/durabletask-java/blob/5b0c54de018c62f1fbb6746ef10c6731073c4148/client/src/main/java/com/microsoft/durabletask/DurableTaskGrpcWorker.java#L26

kaibocai commented 1 year ago

Adding @kamperiadis for awareness.

kamperiadis commented 1 year ago

Thank you @kanupriya15025 for reporting this issue. Can you please clarify for me what you meant by "the event is read and then discarded." Just making sure that I understand the issue at hand - so you expected the "Patch received before start of schedule." to be shown since the patchEvent completed before the timer but you only saw the log when the timer was less than 3 days. Am I understanding it correctly?

kamperiadis commented 1 year ago

From my own testing, this seems to be happening because the orchestration does not finish starting before the external event is triggered. I was able to reproduce this issue when the timer had relatively large values since that's when I notice the orchestration taking longer to start. However, when I tested with a timer of 7 days, I still got the expected behavior, so I do not think this is related to the changes in #122 (i.e. 3 days did not consistently become the constraint for me). When testing with larger numbers (e.g. 10, 15), I was able to reproduce the issue more consistently.

kamperiadis commented 1 year ago

Hi @kanupriya15025! I was unable to reproduce this issue by following these steps:

  1. Make an API call to start the orchestration
  2. Wait for the orchestration to start
  3. Trigger the external event before the timer was up
  4. See the "winner is the external event" message in the logs

This is the sample code I am running

Endpoint that will trigger the orchestration:

    @FunctionName("StartOrchestration")
    public HttpResponseMessage startOrchestration(
        @HttpTrigger(name = "req",
                methods = {HttpMethod.GET, HttpMethod.POST},
                authLevel = AuthorizationLevel.ANONYMOUS) final HttpRequestMessage<Optional<String>> request,
        @DurableClientInput(name = "durableContext") final DurableClientContext durableContext,
        final ExecutionContext context) {
        context.getLogger().info("Java HTTP trigger processed a request");

        final DurableTaskClient client = durableContext.getClient();
        final String instanceId = client.scheduleNewOrchestrationInstance("WaitForExternalEvent");
        return durableContext.createCheckStatusResponse(request, instanceId);
    }

Orchestration logic similar to what you have described:

        Task<Something> event = ctx.waitForExternalEvent("Approval", Something.class);
        Task<?> timer = ctx.createTimer(Duration.ofDays(7)); 
        Task<?> winner = ctx.anyOf(event, timer).await(); 
        if (winner == event) {
            response = event.await();
            if (!ctx.getIsReplaying()) {
                context.getLogger().info("winner is the external event");
                ctx.continueAsNew(response);
                return;
            }
        }
        else {
            context.getLogger().info("winner is the timer");
        }

External event logic:

durableContext.getClient().raiseEvent(instanceId, "Approval", object);

Can you please let me know if these are not the right steps to reproduce the issue? Thank you.

kaibocai commented 1 year ago

I think the issue comes from here https://github.com/microsoft/durabletask-java/blob/293489f269972678ba0e1bc1ee4cfabd7b8ec5b5/client/src/main/java/com/microsoft/durabletask/TaskOrchestrationExecutor.java#L571

We explicitly await the internal timer to complete. So when customer have a timer larger than 3 days they will have an internal timer we set for them, and the code is waiting on this timer to complete first. In this case, all the external events are buffered in queue because the internal timer is not completed so no further action is taken on those events. This also explain why when timer smaller than 3 days we don't see this issue.

We may need to find another way to resolve issue rasied at https://github.com/microsoft/durabletask-java/issues/114

cc @cgillum @lilyjma

kaibocai commented 12 months ago

In this case, I don't think there is a workaround on SDK side, seems we need to update the DF extension accordingly. @cgillum , @davidmrdavid do you have some better ideas? Thanks.

cgillum commented 12 months ago

@kaibocai I'm just now looking at the code you linked to, and I think you're correct about what the problem is. However, I disagree about the solution. There's nothing the extension can do to fix this - it merely persists the actions provided to it by the SDK. Rather, the problem in my opinion is that the SDK implementation is incorrect - we should not be blocking the orchestration like we are in the line you pointed to. The createTimer method needs to be reimplemented in a way that is non-blocking.

kamperiadis commented 10 months ago

@kanupriya15025 We just released SDK v1.4.0, which has the fix for this issue. You can find it here Central Repository: com/microsoft/durabletask-azure-functions/1.4.0 (apache.org) Thank you

kanupriya15025 commented 8 months ago

@kamperiadis Now the event is read properly but in a code similar to this :

        Task<Something> event = ctx.waitForExternalEvent("Approval", Something.class);
        Task<?> timer = ctx.createTimer(Duration.ofDays(7)); 
        Task<?> winner = ctx.anyOf(event, timer).await(); 
        if (winner == event) {
            response = event.await();
            if (!ctx.getIsReplaying()) {
                context.getLogger().info("winner is the external event");
                ctx.continueAsNew(response);
                return;
            }
        }
        else {
            context.getLogger().info("winner is the timer");
        }

The log line "winner is the external event" keeps on printing infinitely. (Note that it's under a condition, so the log line is not replaying. The event is somehow being replayed again and again with each execution.