Open marten-seemann opened 1 year ago
Sorry for the delay. I was finally able to track down what's going on here. Turns out, this happens when the machine is booted up correctly, added to GitHub, but then GitHub doesn't respond to us :( I don't know yet how to fix it - we might have to add some sort of retry mechanism. For now, I created an alert for myself to at least be aware of when it happens - https://github.com/pl-strflt/tf-aws-gh-runner/pull/31. I'll also reach out to GitHub support for guidance. Maybe they'll be able to give me more info on what's exactly going on there.
I suspect that we're seeing download errors in Set up job
setps sometimes because of the same underlying issue:
Current runner version: '2.305.0'
Runner name: ''
Runner group name: 'Default'
Machine name: ''
##[group]Operating System
Distribution: Ubuntu 22.04.2 LTS
Architecture: x86_64
##[endgroup]
##[group]Runner Image
AMI id: ''
##[endgroup]
##[group]GITHUB_TOKEN Permissions
Actions: read
Checks: read
Contents: read
Deployments: read
Discussions: read
Issues: read
Metadata: read
Packages: read
Pages: read
PullRequests: read
RepositoryProjects: read
SecurityEvents: read
Statuses: read
##[endgroup]
Secret source: None
Prepare workflow directory
Prepare all required actions
Getting action download info
Download action repository 'actions/checkout@v3' (SHA:c85c95e3d7251135ab7dc9ce3241c5835cc595a9)
Download action repository 'r7kamura/rust-problem-matchers@d58b70c4a13c4866d96436315da451d8106f8f08' (SHA:d58b70c4a13c4866d96436315da451d8106f8f08)
##[warning]Failed to download action 'https://api.github.com/repos/r7kamura/rust-problem-matchers/tarball/d58b70c4a13c4866d96436315da451d8106f8f08'. Error: The request was canceled due to the configured HttpClient.Timeout of 100 seconds elapsing.
##[warning]Back off 25.067 seconds before retry.
##[warning]Failed to download action 'https://api.github.com/repos/r7kamura/rust-problem-matchers/tarball/d58b70c4a13c4866d96436315da451d8106f8f08'. Error: The request was canceled due to the configured HttpClient.Timeout of 100 seconds elapsing.
##[warning]Back off 23.198 seconds before retry.
Download action repository 'dtolnay/rust-toolchain@stable' (SHA:4f366e621dc8fa63f557ca04b8f4361824a35a45)
Download action repository 'Swatinem/rust-cache@dd05243424bd5c0e585e4b55eb2d7615cdd32f1f' (SHA:dd05243424bd5c0e585e4b55eb2d7615cdd32f1f)
Download action repository 'taiki-e/cache-cargo-install-action@924d49e0af41f449f0ad549559bc608ee4653562' (SHA:924d49e0af41f449f0ad549559bc608ee4653562)
Getting action download info
Download action repository 'actions/cache@v3' (SHA:88522ab9f39a2ea568f7027eddc7d8d8bc9d59c8)
Complete job name: Test
BTW, I know it's not ideal but as a quick way to get out of the infinite wait, cancel workflow run + rerun failed jobs should work.
This doesn't happen very often, but this is the 2nd or 3rd time I'm running into this issue: A job is not picked up by a runner, even after waiting for 15 minutes or more. Not sure if that's because no machine is getting booted, or booting fails, or something else.
Here's the run that didn't get picked up: https://github.com/quic-go/quic-go/actions/runs/5331576697/jobs/9660559692?pr=3908 (Not sure if this link still works after I restart the job).
Here's a screenshot: