Closed Mark-Simulacrum closed 2 years ago
@bors r+
This'll have no effect on the agents, and the server won't be updated until I manually redeploy currently, so mostly just preparing this for doing so tomorrow when I have time to monitor.
:pushpin: Commit 5406236f326cbe50e6a1afc611938e8e3fdf4b70 has been approved by Mark-Simulacrum
:hourglass: Testing commit 5406236f326cbe50e6a1afc611938e8e3fdf4b70 with merge d89aa84657675ce3d7d9d41a9b818a2485f40ca0...
:sunny: Test successful - checks-actions Approved by: Mark-Simulacrum Pushing d89aa84657675ce3d7d9d41a9b818a2485f40ca0 to master...
Previously, a given crate was assigned to a particular agent, and then that agent had to either complete or fail (e.g., panic) before it was 'released' for another agent to attempt. This is error prone, and makes it harder for agents to be ephemeral (including their name), as it means that it requires manual intervention if an agent just vanishes to put the assigned crates back into the general queue.
We switch to a different assignment scheme. Now, when an agent requests crates from a given experiment, we will dequeue crates that are queued (i.e., incomplete) and that have not started yet or have started >20 minutes ago. The current agent timeout is blanket set at 15 minutes, and even if it was longer, we expect that most crates build quite quickly. Duplicate builds are also fine for our case (we'll just get a couple extra results but that'll be deduplicated already when inserting into that table).