uber / cadence

Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.
https://cadenceworkflow.io
MIT License
8.26k stars 796 forks source link

Activity start to close timeout includes the time spent in activity poller buffer #3475

Closed sokada1221 closed 3 years ago

sokada1221 commented 4 years ago

Describe the bug

  1. Activity poller picked up two tasks: task A and B. One of them being a long-running activity task with Thread.sleep (let's say A is).
  2. Task execution thread starts working on task A, which takes for a long time (~6 minutes)
  3. Once it completes working on task A, it starts working on task B.
  4. After it completes task B, it realizes that the workflow has already failed due to activity start to close timeout (1 minute).

To Reproduce Is the issue reproducible?

Steps to reproduce the behavior:

  1. Register the activity with 2 functions with different runtime requirement: one that takes a long time (~6 minutes) and another that takes a very short time (~20 ms). We also set the following worker option:
    WorkerOptions hostSpecificWorkerOptions =
        new WorkerOptions.Builder()
            .setMaxConcurrentActivityExecutionSize(8)
            .setMetricsScope(metricsScope)
            .build();
  2. Submit > 50 workflows that call the tasks sequentially (task B -> task A).

Expected behavior Activities to not timeout when they're sitting in a poller buffer.

Screenshots N/A

Additional context

2020-08-25T00:00:59 - Task A ActivityTaskStarted 2020-08-25T00:01:05,089 - Starts executing another task Z 2020-08-25T00:01:05,101 - Finishes executing task Z 2020-08-25T00:01:05,194 - Starts executing Task A 2020-08-25T00:01:08 - Task B ActivityTaskStarted 2020-08-25T00:07:05,357 - Finishes executing task A 2020-08-25T00:07:05,380 - Starts executing Task B 2020-08-25T00:07:05,394 - Finishes executing task B 2020-08-25T00:07:05,398 java.lang.RuntimeException: Failure processing activity task Caused by: com.uber.cadence.EntityNotExistsError: workflow execution already completed

2020-08-25T00:07:05,398 [Activity Executor taskList="HostSpecificTaskList", domain="test": 3] ERROR er.cadence.internal.worker.PollerOptions:120 - uncaught exception
java.lang.RuntimeException: Failure processing activity task. WorkflowID=3345, RunID=d585fcd6-5438-4c8b-8848-ad8b2439601b, ActivityType=HostSpecificActivities::taskB, ActivityID=6
        at com.uber.cadence.internal.worker.ActivityWorker$TaskHandlerImpl.wrapFailure(ActivityWorker.java:242) ~[cadence-client-2.7.4.jar!/:?]
        at com.uber.cadence.internal.worker.ActivityWorker$TaskHandlerImpl.wrapFailure(ActivityWorker.java:143) ~[cadence-client-2.7.4.jar!/:?]
        at com.uber.cadence.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:76) ~[cadence-client-2.7.4.jar!/:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_252]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_252]
        at java.lang.Thread.run(Unknown Source) [?:1.8.0_252]
Caused by: com.uber.cadence.EntityNotExistsError: workflow execution already completed
        at com.uber.cadence.WorkflowService$RespondActivityTaskCompleted_result$RespondActivityTaskCompleted_resultStandardScheme.read(WorkflowService.java:25130) ~[cadence-client-2.7.4.jar!/:?]
        at com.uber.cadence.WorkflowService$RespondActivityTaskCompleted_result$RespondActivityTaskCompleted_resultStandardScheme.read(WorkflowService.java:25107) ~[cadence-client-2.7.4.jar!/:?]
        at com.uber.cadence.WorkflowService$RespondActivityTaskCompleted_result.read(WorkflowService.java:25017) ~[cadence-client-2.7.4.jar!/:?]
        at org.apache.thrift.TDeserializer.deserialize(TDeserializer.java:81) ~[libthrift-0.9.3.jar!/:0.9.3]
        at org.apache.thrift.TDeserializer.deserialize(TDeserializer.java:67) ~[libthrift-0.9.3.jar!/:0.9.3]
        at com.uber.tchannel.messages.ThriftSerializer.decodeBody(ThriftSerializer.java:72) ~[tchannel-core-0.8.5.jar!/:?]
        at com.uber.tchannel.messages.Serializer.decodeBody(Serializer.java:49) ~[tchannel-core-0.8.5.jar!/:?]
        at com.uber.tchannel.messages.EncodedResponse.getBody(EncodedResponse.java:85) ~[tchannel-core-0.8.5.jar!/:?]
        at com.uber.cadence.serviceclient.WorkflowServiceTChannel.respondActivityTaskCompleted(WorkflowServiceTChannel.java:1206) ~[cadence-client-2.7.4.jar!/:?]
        at com.uber.cadence.serviceclient.WorkflowServiceTChannel.lambda$RespondActivityTaskCompleted$16(WorkflowServiceTChannel.java:1193) ~[cadence-client-2.7.4.jar!/:?]
        at com.uber.cadence.serviceclient.WorkflowServiceTChannel.lambda$measureRemoteProc$1(WorkflowServiceTChannel.java:595) ~[cadence-client-2.7.4.jar!/:?]
        at com.uber.cadence.serviceclient.WorkflowServiceTChannel.measureRemoteCall(WorkflowServiceTChannel.java:569) ~[cadence-client-2.7.4.jar!/:?]
        at com.uber.cadence.serviceclient.WorkflowServiceTChannel.measureRemoteProc(WorkflowServiceTChannel.java:592) ~[cadence-client-2.7.4.jar!/:?]
        at com.uber.cadence.serviceclient.WorkflowServiceTChannel.RespondActivityTaskCompleted(WorkflowServiceTChannel.java:1192) ~[cadence-client-2.7.4.jar!/:?]
        at com.uber.cadence.internal.worker.ActivityWorker$TaskHandlerImpl.lambda$sendReply$1(ActivityWorker.java:260) ~[cadence-client-2.7.4.jar!/:?]
        at com.uber.cadence.internal.common.Retryer.lambda$retry$0(Retryer.java:104) ~[cadence-client-2.7.4.jar!/:?]
        at com.uber.cadence.internal.common.Retryer.retryWithResult(Retryer.java:122) ~[cadence-client-2.7.4.jar!/:?]
        at com.uber.cadence.internal.common.Retryer.retry(Retryer.java:101) ~[cadence-client-2.7.4.jar!/:?]
        at com.uber.cadence.internal.worker.ActivityWorker$TaskHandlerImpl.sendReply(ActivityWorker.java:260) ~[cadence-client-2.7.4.jar!/:?]
        at com.uber.cadence.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:184) ~[cadence-client-2.7.4.jar!/:?]
        at com.uber.cadence.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:143) ~[cadence-client-2.7.4.jar!/:?]
        at com.uber.cadence.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:71) ~[cadence-client-2.7.4.jar!/:?]
        ... 3 more
meiliang86 commented 4 years ago

@sokada1221 Activity tasks are executed by different activity threads. There's no synchronization Cadence does between them as they are totally different. What matters here is the setMaxConcurrentActivityExecutionSize, which determines the threadpool size. If you have more tasks (especially long running) than your processing power, they will simply be queued up. Remember activity is a blackbox to Cadence, so even if you sleep in Activity we will not be able to evict your task from that thread. We will if you do workflow.sleep inside your workflow.

sokada1221 commented 4 years ago

@meiliang86 Thanks for your insight! But I don't think we need synchronization between different activities here. What I'm wondering is - why are poller threads picking up more tasks than what Activity worker (activity execution threads) can handle? In the example above, poller is picking up 2 tasks for 1 execution thread. This leads to very unpredictable timeouts.

sokada1221 commented 4 years ago

Found the root cause of the problem. Since activity is considered "started" when activity poller polls the task from Cadence, activities with shorter start to close timeouts will timeout in the buffer when task executor finishes working on long running activities. This issue amplifies especially when:

  1. Deviation in activity method execution times is high; and
  2. Task execution thread pool is more likely to be blocked (e.g. due to max concurrency limit)

Renaming the ticket to better describe the problem.

sokada1221 commented 4 years ago

The workaround is to align execution times of activity methods, and set the same start to close timeouts for all methods.

Note that it doesn't completely solve the problem as start to close timeout still includes the time spent in poller buffer. To completely address the problem, we'd need a new design to either:

  1. Remove the poller buffer (as an option of course); or
  2. Revise the state management so that the time spent in poller buffer isn't considered part of start to close timeout (e.g. introduce a new internal state for activities in poller buffer)
meiliang86 commented 3 years ago

@sokada1221 Yes you are right. With the current design, the poller always pre-fetch tasks, up to pollThreadCount. The fix is to not pull tasks if all the processing threads are busy. BTW this is a java client issue so I will move it to the client repo.

meiliang86 commented 3 years ago

One mitigation is to have smaller schedule_to_start timeout for the short activity, so it can get retried on a different worker node more quickly. This, of course, assumes that you have enough worker capacity. If all workers are busy processing long running tasks then there's not much we can do theoretically, and the activity can timeout.