Closed sokada1221 closed 3 years ago
@sokada1221 Activity tasks are executed by different activity threads. There's no synchronization Cadence does between them as they are totally different. What matters here is the setMaxConcurrentActivityExecutionSize, which determines the threadpool size. If you have more tasks (especially long running) than your processing power, they will simply be queued up. Remember activity is a blackbox to Cadence, so even if you sleep in Activity we will not be able to evict your task from that thread. We will if you do workflow.sleep inside your workflow.
@meiliang86 Thanks for your insight! But I don't think we need synchronization between different activities here. What I'm wondering is - why are poller threads picking up more tasks than what Activity worker (activity execution threads) can handle? In the example above, poller is picking up 2 tasks for 1 execution thread. This leads to very unpredictable timeouts.
Found the root cause of the problem. Since activity is considered "started" when activity poller polls the task from Cadence, activities with shorter start to close timeouts will timeout in the buffer when task executor finishes working on long running activities. This issue amplifies especially when:
Renaming the ticket to better describe the problem.
The workaround is to align execution times of activity methods, and set the same start to close timeouts for all methods.
Note that it doesn't completely solve the problem as start to close timeout still includes the time spent in poller buffer. To completely address the problem, we'd need a new design to either:
@sokada1221 Yes you are right. With the current design, the poller always pre-fetch tasks, up to pollThreadCount. The fix is to not pull tasks if all the processing threads are busy. BTW this is a java client issue so I will move it to the client repo.
One mitigation is to have smaller schedule_to_start timeout for the short activity, so it can get retried on a different worker node more quickly. This, of course, assumes that you have enough worker capacity. If all workers are busy processing long running tasks then there's not much we can do theoretically, and the activity can timeout.
Describe the bug
To Reproduce Is the issue reproducible?
Steps to reproduce the behavior:
Expected behavior Activities to not timeout when they're sitting in a poller buffer.
Screenshots N/A
Additional context
2020-08-25T00:00:59 - Task A ActivityTaskStarted 2020-08-25T00:01:05,089 - Starts executing another task Z 2020-08-25T00:01:05,101 - Finishes executing task Z 2020-08-25T00:01:05,194 - Starts executing Task A 2020-08-25T00:01:08 - Task B ActivityTaskStarted 2020-08-25T00:07:05,357 - Finishes executing task A 2020-08-25T00:07:05,380 - Starts executing Task B 2020-08-25T00:07:05,394 - Finishes executing task B 2020-08-25T00:07:05,398 java.lang.RuntimeException: Failure processing activity task Caused by: com.uber.cadence.EntityNotExistsError: workflow execution already completed