treasure-data / digdag

Workload Automation System
https://www.digdag.io/
Apache License 2.0
1.3k stars 221 forks source link

Optimizing WorkflowExecutor.enqueueReadyTasks #1752

Open frsyuki opened 2 years ago

frsyuki commented 2 years ago

WorkflowExecutor.enqueueReadyTasks runs in follow steps:

  1. Take at most executor.enqueue_fetch_size number of tasks
  2. For each task, 2-1. Lock the task if the task is still ready 2-2. Enqueue the task

Step 2-1 checks task state again ("recheck") because another thread may enqueue the task during the operation (notice that step 1 doesn't lock the tasks).

"recheck" needs to run a SELECT statement on the database to get state & lock the task atomicly. If "recheck" doesn't pass, the task will be ignored (step 2-2 doesn't run).

"recheck" may not pass very frequently when following conditions are met:

Frequent failing "recheck" means that a lot of SELECT operations waste database workload. It also wastes digdag server's thread time.

This change optimizes step 1 & 2 as following:

  1. Find one ready task and lock it atomicly
  2. Enqueue the task

On PostgreSQL, step 1 can be done using one SELECT statement. This solves above potential problem.

On H2 database, step 2 needs two SELECT statements. Thus this commit won't optimize performance. But notice that above problem won't happen on H2 database because a database won't be shared by multiple servers.