WorkflowExecutor.enqueueReadyTasks runs in follow steps:
Take at most executor.enqueue_fetch_size number of tasks
For each task,
2-1. Lock the task if the task is still ready
2-2. Enqueue the task
Step 2-1 checks task state again ("recheck") because another thread may
enqueue the task during the operation (notice that step 1 doesn't
lock the tasks).
"recheck" needs to run a SELECT statement on the database to get state
& lock the task atomicly. If "recheck" doesn't pass, the task will be
ignored (step 2-2 doesn't run).
"recheck" may not pass very frequently when following conditions are met:
executor.enqueue_fetch_size is large (default, 100, is already large)
Many threads run on the same database (e.g. many digdag servers exist)
Step 2 takes relatively long amount of time (e.g. latency of database
operations are large, the database is overloaded temporarily, or the
digdag server is overloaded temporarily)
Frequent failing "recheck" means that a lot of SELECT operations waste
database workload. It also wastes digdag server's thread time.
This change optimizes step 1 & 2 as following:
Find one ready task and lock it atomicly
Enqueue the task
On PostgreSQL, step 1 can be done using one SELECT statement. This
solves above potential problem.
On H2 database, step 2 needs two SELECT statements. Thus this commit
won't optimize performance. But notice that above problem won't happen
on H2 database because a database won't be shared by multiple servers.
WorkflowExecutor.enqueueReadyTasks
runs in follow steps:executor.enqueue_fetch_size
number of tasksStep 2-1 checks task state again ("recheck") because another thread may enqueue the task during the operation (notice that step 1 doesn't lock the tasks).
"recheck" needs to run a SELECT statement on the database to get state & lock the task atomicly. If "recheck" doesn't pass, the task will be ignored (step 2-2 doesn't run).
"recheck" may not pass very frequently when following conditions are met:
executor.enqueue_fetch_size
is large (default, 100, is already large)Frequent failing "recheck" means that a lot of SELECT operations waste database workload. It also wastes digdag server's thread time.
This change optimizes step 1 & 2 as following:
On PostgreSQL, step 1 can be done using one SELECT statement. This solves above potential problem.
On H2 database, step 2 needs two SELECT statements. Thus this commit won't optimize performance. But notice that above problem won't happen on H2 database because a database won't be shared by multiple servers.