Optimizing WorkflowExecutor.enqueueReadyTasks

WorkflowExecutor.enqueueReadyTasks runs in follow steps:

Take at most executor.enqueue_fetch_size number of tasks
For each task, 2-1. Lock the task if the task is still ready 2-2. Enqueue the task

Step 2-1 checks task state again ("recheck") because another thread may enqueue the task during the operation (notice that step 1 doesn't lock the tasks).

"recheck" needs to run a SELECT statement on the database to get state & lock the task atomicly. If "recheck" doesn't pass, the task will be ignored (step 2-2 doesn't run).

"recheck" may not pass very frequently when following conditions are met:

executor.enqueue_fetch_size is large (default, 100, is already large)
Many threads run on the same database (e.g. many digdag servers exist)
Step 2 takes relatively long amount of time (e.g. latency of database operations are large, the database is overloaded temporarily, or the digdag server is overloaded temporarily)

Frequent failing "recheck" means that a lot of SELECT operations waste database workload. It also wastes digdag server's thread time.

This change optimizes step 1 & 2 as following:

Find one ready task and lock it atomicly
Enqueue the task

On PostgreSQL, step 1 can be done using one SELECT statement. This solves above potential problem.

On H2 database, step 2 needs two SELECT statements. Thus this commit won't optimize performance. But notice that above problem won't happen on H2 database because a database won't be shared by multiple servers.

treasure-data / digdag

Optimizing WorkflowExecutor.enqueueReadyTasks #1752