opensafely-core / job-server

A server for mediating jobs that can be run in an OpenSAFELY secure environment. q.v. job-runner
https://jobs.opensafely.org
Other
5 stars 11 forks source link

Out-of-order deploys #4607

Closed iaindillingham closed 1 month ago

iaindillingham commented 2 months ago

@benbc has noticed what look like out-of-order deploys. From Slack:^1

This CI run starts at 10:06 and restarts Job Server at 10:16. https://github.com/opensafely-core/job-server/actions/runs/10699532675

This CI run starts at 10:07 and restarts Job Server at 10:13. https://github.com/opensafely-core/job-server/actions/runs/10699535664

They were triggered by two PRs merged shortly one after the other. The merge order corresponds to the CI run start times, but the deploy times are the other way round because the first workflow took longer.

*   c1a8c3ef - Merge pull request #4580 from opensafely-core/dependabot/pip/ruff-0.6.3 (9 days ago) <Jon Massey>
|\  
| * 6e892cf0 - Bump ruff from 0.6.2 to 0.6.3 (11 days ago) <dependabot[bot]>
* |   01fa5232 - Merge pull request #4579 from opensafely-core/dependabot/pip/stamina-24.3.0 (9 days ago) <Jon Massey>
|\ \  
| |/  
|/|   
| * 0ca98853 - Bump stamina from 24.2.0 to 24.3.0 (11 days ago) <dependabot[bot]>
|/  
*
StevenMaude commented 2 months ago

We do have concurrency set on the deploy job in the main workflow, which prevents more than one of those jobs running at once.

https://github.com/opensafely-core/job-server/blob/6498030572305df8f5a6e32d107922653e819667/.github/workflows/main.yml#L148-L159

However, that will only prevent more than deploy job happening at once.

There is no concurrency set on the entire main workflow. So something like the following has happened:

  1. main run 1 starts first
  2. main run 2 starts second, a short while after run 1
  3. Tests in run 2 actually complete first, due to randomness in GitHub Actions (say, runner resource contention, CPU, disk or network…)
  4. deploy steps in main run 2 start
  5. Tests in run 1 complete second
  6. deploy steps in main run 1 wait until existing deploy finishes
  7. deploy steps in main run 2 complete, completing main run 2
  8. deploy steps in main run 1 start
  9. deploy steps in main run 1 complete, completing main run 1, out of order

I guess that the concurrency only actually applies when the job is actually ready to run, not blocking from the outset when the needs dependencies are being waited on.

StevenMaude commented 2 months ago

We're using the main workflow for our general CI. As it is, if we set concurrency on the entire main workflow as is, we'd only have one allowed workflow run for any merges or pull requests to this repository.

Maybe we could separate out test and deployment workflows, then the deployment workflow could:

That way, only one deployment (including tests) will run at once.

A variation could be: use cancel-in-progress for concurrency on the entirety of this separated out deployment workflow, which might speed things up if you launch a second deployment just after a first. Instead of waiting for the first to complete, we cancel it and run the second.

StevenMaude commented 1 month ago

I'm going to follow up on this and check what happens when merging several Dependabot PRs in quick succession, week commencing 2024-10-07.