hail-is / hail

Cloud-native genomic dataframes and batch computing
https://hail.is
MIT License
972 stars 243 forks source link

[batch] Orphaned attempts loop performs full scan of instances table #14460

Open daniel-goldstein opened 5 months ago

daniel-goldstein commented 5 months ago

What happened?

Batch does not guarantee that there is always at most 1 running attempt for a job at any given time. While rare, this double scheduling can sometimes happen so there is a background task that checks the database for "orphaned" attempts -- attempts that are running but are not noted as the current attempt for the relevant job -- and stops them to reduce wasted spend. This query that polls the database for attempts to remove does a needless scan of the instances table. I'll describe below the process by which I discovered the inefficiency:

  1. GCP Cloud SQL has a nice feature Query Insights, in which reports latencies and rows scanned by popular queries. For singular queries it can show a graph of the query and indicate bottlenecks. The below query is currently scanning over a million rows of the instances table is taking on average 642.37 ms:

https://github.com/hail-is/hail/blob/091e6612752010880a130cf4010897e87ea2a864/batch/batch/driver/canceller.py#L373-L382

as shown here from Query Insights: Screenshot 2024-04-11 at 10 31 17 AM

  1. The thick edge on the instances scan indicates that the where condition for instances is not using an index. We can verify this by explaining the query against the DB:
    
    > kssh admin-pod
    > mysql
    mysql> use batch;
    mysql> EXPLAIN SELECT attempts.*
    -> FROM attempts
    -> INNER JOIN jobs ON attempts.batch_id = jobs.batch_id AND attempts.job_id = jobs.job_id
    -> LEFT JOIN instances ON attempts.instance_name = instances.name
    -> WHERE attempts.start_time IS NOT NULL
    ->   AND attempts.end_time IS NULL
    ->   AND ((jobs.state != 'Running' AND jobs.state != 'Creating') OR jobs.attempt_id != attempts.attempt_id)
    ->   AND instances.`state` = 'active'
    -> ORDER BY attempts.start_time ASC
    -> LIMIT 300\G;

1. row id: 1 select_type: SIMPLE table: instances partitions: NULL type: ALL possible_keys: PRIMARY key: NULL key_len: NULL ref: NULL rows: 1150201 filtered: 10.00 Extra: Using where; Using temporary; Using filesort 2. row id: 1 select_type: SIMPLE table: attempts partitions: NULL type: ref possible_keys: PRIMARY,attempts_instance_name key: attempts_instance_name key_len: 303 ref: batch.instances.name rows: 91 filtered: 9.00 Extra: Using where 3. row id: 1 select_type: SIMPLE table: jobs partitions: NULL type: eq_ref possible_keys: PRIMARY,jobs_batch_id_state_always_run_cancelled,jobs_batch_id_state_always_run_inst_coll_cancelled,jobs_batch_id_update_id,jobs_batch_id_always_run_n_regions_regions_bits_rep_job_id,jobs_batch_id_ic_state_ar_n_regions_bits_rep_job_id,jobs_batch_id_job_group_id,jobs_batch_id_ic_state_ar_n_regions_bits_rep_job_group_id key: PRIMARY key_len: 12 ref: batch.attempts.batch_id,batch.attempts.job_id rows: 1 filtered: 98.10 Extra: Using where 3 rows in set, 1 warning (0.00 sec)


This is not great:
     rows: 1150201
 filtered: 10.00
    Extra: Using where; Using temporary; Using filesort

what we want to see is a low number of rows, a high percent filtered, and something like `Extra: Using where; Using index;`

3. We can then verify that this finding aligns with our current understanding of the database schema where there is no index on `instances.state`: 
https://github.com/hail-is/hail/blob/1f3a0503926b65f479dce6d5eb105236632f8d07/batch/sql/estimated-current.sql#L113-L138

### Remaining questions
1. Why do we need to join against the instances table to find orphaned attempts? Can we not?
2. If we do need to join against the instances table, should we create an index on `instances.state`? How does `instances.removed` relate to this use case since it seems relevant and already has an index?

### Deliverable
This query should perform indexed lookups of involved tables. The PR should compare the present and proposed EXPLAINs
and provide some manual timing comparisons.

### Version

0.2.130

### Relevant log output

_No response_
daniel-goldstein commented 2 months ago

This diff uses the index on instances.removed

diff --git a/batch/batch/driver/canceller.py b/batch/batch/driver/canceller.py
index d438a8519b..594b180221 100644
--- a/batch/batch/driver/canceller.py
+++ b/batch/batch/driver/canceller.py
@@ -371,10 +371,9 @@ SELECT attempts.*
 FROM attempts
 INNER JOIN jobs ON attempts.batch_id = jobs.batch_id AND attempts.job_id = jobs.job_id
 LEFT JOIN instances ON attempts.instance_name = instances.name
-WHERE attempts.start_time IS NOT NULL
-  AND attempts.end_time IS NULL
+WHERE attempts.end_time IS NULL
   AND ((jobs.state != 'Running' AND jobs.state != 'Creating') OR jobs.attempt_id != attempts.attempt_id)
-  AND instances.`state` = 'active'
+  AND instances.removed = 0
 ORDER BY attempts.start_time ASC
 LIMIT 300;
 """,