Open cisco-sbg-mgiassa-ai opened 2 weeks ago
The control plane part that is scaling the runners differentiates between scaling and other errors. The scaling errors trigger a retry by pushing the message back to SQS. The list of errors trigger the retry is hard coded (see https://github.com/philips-labs/terraform-aws-github-runner/blob/e59885a2b66f7afa7a36c3583f663c4d52973459/lambdas/functions/control-plane/src/aws/runners.ts#L170-L181).
AddressLimitExceeded
seems the error that is happening in your case. A small code update could fix this. But would be better when we make this list configuration driven.
Feel free to raise an PR to update the list, or even improve the code further. Also no back-of is implemented. No explicit back-off is implemented. See also https://docs.aws.amazon.com/lambda/latest/dg/services-sqs-errorhandling.html
Good day,
I have a multi-tenanted CICD system that uses multi-runner to handle runner management, standby/warm-up pools, etc. etc. (and it works very well, by the way 😄 ). I'm also using fairly up-to-date code (i.e.
v5.15.2
of this project), along with up-to-date GHAactions/runner
agent/tools.I have a set of runners that use a shared/multi-team subnet in AWS. There are occasions where a tenant over-commits runners, and exhausts the IPv4 address space supplied by the subnet. For most operations, GHA queues-up/serializes jobs nicely. For example, if some runner-type has an upper limit of 30 instances, and 60 jobs are queued up, all of the jobs eventually run to successful completion. One semi-related "corner case" where this doesn't happen, however, is if a runner fails to start due to insufficient space in the subnet being used to launch the runner (i.e. "insufficient IP space available").
Is there some mechanism/feature-flag/etc. that exists (or that could reasonably be implemented) so that some sort of "re-try this scale-up operation with some back-off timer to prevent API spam/overload" could be provided? It would be preferable to have the job eventually get queued-up, even if it means waiting for a progressively lengthier duration, versus having a job stuck in the "wait for a runner" state for 18 hours (as a specific example). In this contrived example, case, it'd be desirable if the job eventually were queued up (say, during low-usage periods overnight when there's ample capacity) instead of requiring user interaction (or a CI bot to auto-cancel "stuck" jobs).
Besides this quirk, this is an awesome project/tool that has been extremely helpful/useful. Cheers!