philips-labs / terraform-aws-github-runner

Terraform module for scalable GitHub action runners on AWS
https://philips-labs.github.io/terraform-aws-github-runner/
MIT License
2.53k stars 602 forks source link

multi-runner: re-try scale-up operation for runner-type if it fails due to insufficient IPv4 addresses in subnet #4105

Open cisco-sbg-mgiassa-ai opened 2 weeks ago

cisco-sbg-mgiassa-ai commented 2 weeks ago

Good day,

I have a multi-tenanted CICD system that uses multi-runner to handle runner management, standby/warm-up pools, etc. etc. (and it works very well, by the way 😄 ). I'm also using fairly up-to-date code (i.e. v5.15.2 of this project), along with up-to-date GHA actions/runner agent/tools.

I have a set of runners that use a shared/multi-team subnet in AWS. There are occasions where a tenant over-commits runners, and exhausts the IPv4 address space supplied by the subnet. For most operations, GHA queues-up/serializes jobs nicely. For example, if some runner-type has an upper limit of 30 instances, and 60 jobs are queued up, all of the jobs eventually run to successful completion. One semi-related "corner case" where this doesn't happen, however, is if a runner fails to start due to insufficient space in the subnet being used to launch the runner (i.e. "insufficient IP space available").

Is there some mechanism/feature-flag/etc. that exists (or that could reasonably be implemented) so that some sort of "re-try this scale-up operation with some back-off timer to prevent API spam/overload" could be provided? It would be preferable to have the job eventually get queued-up, even if it means waiting for a progressively lengthier duration, versus having a job stuck in the "wait for a runner" state for 18 hours (as a specific example). In this contrived example, case, it'd be desirable if the job eventually were queued up (say, during low-usage periods overnight when there's ample capacity) instead of requiring user interaction (or a CI bot to auto-cancel "stuck" jobs).

Besides this quirk, this is an awesome project/tool that has been extremely helpful/useful. Cheers!

npalm commented 1 week ago

The control plane part that is scaling the runners differentiates between scaling and other errors. The scaling errors trigger a retry by pushing the message back to SQS. The list of errors trigger the retry is hard coded (see https://github.com/philips-labs/terraform-aws-github-runner/blob/e59885a2b66f7afa7a36c3583f663c4d52973459/lambdas/functions/control-plane/src/aws/runners.ts#L170-L181).

AddressLimitExceeded seems the error that is happening in your case. A small code update could fix this. But would be better when we make this list configuration driven.

Feel free to raise an PR to update the list, or even improve the code further. Also no back-of is implemented. No explicit back-off is implemented. See also https://docs.aws.amazon.com/lambda/latest/dg/services-sqs-errorhandling.html