philips-labs / terraform-aws-github-runner

Terraform module for scalable GitHub action runners on AWS
https://philips-labs.github.io/terraform-aws-github-runner/
MIT License
2.61k stars 625 forks source link

feat: scale up for long waiting jobs (job retry) #4064

Closed npalm closed 3 months ago

npalm commented 3 months ago

Description

This feature add the capability to retry scaling a runner when a job is still queued after a defined delay. This feature is added to avoid pool for ephemeral runners.

Implementation

The module is extended with configuration top optional enable one or more retries. Once enabled the scale-up lambda will publish the same message as it recieves extend with a counter on a retry-job-queueu with a delay. A new lambda will pick the message from this queue and checks if the job is still queued (via GitHub API). In case it is still queued it is published again on je the job queue, incoming queue of the scale-up lambda

Consequences

Testing

Testing can be done as follow

Tasks

mariusfilipowski commented 2 months ago

Thanks for this interesting option. 1) What do you mean with small fleets? 2) I assume that these rate limits won't apply to GHES, will they? 3) Can this also be used with idle/pooled runners or do they have to be turned off? We have certain problems in the night, where we set the pool to 0, where we would expect that this approach could help us to get more stability.