Open daniel-goldstein opened 5 months ago
This diff uses the index on instances.removed
diff --git a/batch/batch/driver/canceller.py b/batch/batch/driver/canceller.py
index d438a8519b..594b180221 100644
--- a/batch/batch/driver/canceller.py
+++ b/batch/batch/driver/canceller.py
@@ -371,10 +371,9 @@ SELECT attempts.*
FROM attempts
INNER JOIN jobs ON attempts.batch_id = jobs.batch_id AND attempts.job_id = jobs.job_id
LEFT JOIN instances ON attempts.instance_name = instances.name
-WHERE attempts.start_time IS NOT NULL
- AND attempts.end_time IS NULL
+WHERE attempts.end_time IS NULL
AND ((jobs.state != 'Running' AND jobs.state != 'Creating') OR jobs.attempt_id != attempts.attempt_id)
- AND instances.`state` = 'active'
+ AND instances.removed = 0
ORDER BY attempts.start_time ASC
LIMIT 300;
""",
What happened?
Batch does not guarantee that there is always at most 1 running attempt for a job at any given time. While rare, this double scheduling can sometimes happen so there is a background task that checks the database for "orphaned" attempts -- attempts that are running but are not noted as the current attempt for the relevant job -- and stops them to reduce wasted spend. This query that polls the database for attempts to remove does a needless scan of the instances table. I'll describe below the process by which I discovered the inefficiency:
https://github.com/hail-is/hail/blob/091e6612752010880a130cf4010897e87ea2a864/batch/batch/driver/canceller.py#L373-L382
as shown here from Query Insights:
1. row id: 1 select_type: SIMPLE table: instances partitions: NULL type: ALL possible_keys: PRIMARY key: NULL key_len: NULL ref: NULL rows: 1150201 filtered: 10.00 Extra: Using where; Using temporary; Using filesort 2. row id: 1 select_type: SIMPLE table: attempts partitions: NULL type: ref possible_keys: PRIMARY,attempts_instance_name key: attempts_instance_name key_len: 303 ref: batch.instances.name rows: 91 filtered: 9.00 Extra: Using where 3. row id: 1 select_type: SIMPLE table: jobs partitions: NULL type: eq_ref possible_keys: PRIMARY,jobs_batch_id_state_always_run_cancelled,jobs_batch_id_state_always_run_inst_coll_cancelled,jobs_batch_id_update_id,jobs_batch_id_always_run_n_regions_regions_bits_rep_job_id,jobs_batch_id_ic_state_ar_n_regions_bits_rep_job_id,jobs_batch_id_job_group_id,jobs_batch_id_ic_state_ar_n_regions_bits_rep_job_group_id key: PRIMARY key_len: 12 ref: batch.attempts.batch_id,batch.attempts.job_id rows: 1 filtered: 98.10 Extra: Using where 3 rows in set, 1 warning (0.00 sec)