If a runner goes down while having some number of crates to run, we will not reschedule those onto another runner, which indefinitely stalls the experiment.
This can be used to remove crates from a worker:
update experiment_crates set status = 'queued', assigned_to = null where status = 'running' and assigned_to = 'agent:gcp-1';
This gives us how many crates each worker is holding:
select assigned_to, count(*) from experiment_crates where status = 'running' group by assigned_to;
If a runner goes down while having some number of crates to run, we will not reschedule those onto another runner, which indefinitely stalls the experiment.
This can be used to remove crates from a worker:
This gives us how many crates each worker is holding: