jazzband / django-oauth-toolkit

OAuth2 goodies for the Djangonauts!
https://django-oauth-toolkit.readthedocs.io
Other
3.13k stars 792 forks source link

GH actions long delay between finishing build job and starting success job #1376

Open n2ygk opened 9 months ago

n2ygk commented 9 months ago

Describe the bug

In watching multiple PRs after I've approved them, it appears to take a long time for the success job to start after the last step of the build job has finished. See #1219 where the separate success job was added to make it easier to update the matrix and only ever depend on build to finish for tests to succeed.

To Reproduce

Cause a PR to run tests.

Expected behavior

I didn't expect anything but was hoping that the wait for the success step wouldn't happen.

Version

current master branch

Additional context

@dopry I'm guessing that GH is allocating a runner(s) for each job, so after the build job finishes, we wait for another runner to become available for the success job. This takes a while. See below with timestamps selected. So I am guessing that running a second job that depends on the first has to wait for a new runner to become available. Sometimes correlation is indicative of causation.

Mon, 18 Dec 2023 17:59:18 GMT last matrix step of build job finished Mon, 18 Dec 2023 18:31:45 GMT success job starts

While watching the PR, the success job status is waiting on a runner. Here's some raw log showing the 30 minute wait for a runner:

2023-12-18T17:59:50.5029369Z Requested labels: ubuntu-latest
2023-12-18T17:59:50.5029714Z Job defined at: jazzband/django-oauth-toolkit/.github/workflows/test.yml@refs/heads/pre-commit-ci-update-config
2023-12-18T17:59:50.5029846Z Waiting for a runner to pick up this job...
2023-12-18T18:31:40.3399714Z Job is waiting for a hosted runner to come online.
2023-12-18T18:31:42.6767234Z Job is about to start running on the hosted runner: GitHub Actions 7 (hosted)
...
dopry commented 9 months ago

You are correct in how you describe the behavior. We are probably also throttled a bit since we have such an intense job run. A runner is allocated for every build in the matrix. Maybe explicitly selecting a different runner class for the success job would get it allocated more quickly.

n2ygk commented 9 months ago

Yeah presumably these runners are all counted against the Jazzband org. Can we try this without having to bug @jezdez?

dopry commented 9 months ago

The reason we added the success job to the build process so we wouldn't need @jezdez to intercede to change the success criteria of our builds since we don't have settings access. We should be able to select the machine class by changing runs-on for the success job. Maybe we can get away without specifying it? I'm not sure what the default is...

dopry commented 9 months ago

I think this is something we could maybe open with Github support?

dopry commented 9 months ago

I assume we're waiting on the backlog of jazzband jobs and it's being slowed down by the concurrent job limit, https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration

dopry commented 9 months ago

Another option may be to go ahead and reduce our matrix dropping django 4.0 and django 4.1 since they're no longer supported upstream. That should reduce our matrix by 10 jobs. Success still won't be enqueued until they're complete...

dopry commented 9 months ago

alternatively if @jezdez would give you, @n2ygk, or someone else on team settings access to this repo, then we could manage the branch protections ourselves and wouldn't need the success job since we could update the required checks when needed.

dopry commented 9 months ago

@jezdez @n2ygk I fired off a request to GH support to increase the concurrent build limit for the jazzband organization.