feat: scale up for long waiting jobs (job retry)

Description

This feature add the capability to retry scaling a runner when a job is still queued after a defined delay. This feature is added to avoid pool for ephemeral runners.

Implementation

The module is extended with configuration top optional enable one or more retries. Once enabled the scale-up lambda will publish the same message as it recieves extend with a counter on a retry-job-queueu with a delay. A new lambda will pick the message from this queue and checks if the job is still queued (via GitHub API). In case it is still queued it is published again on je the job queue, incoming queue of the scale-up lambda

Consequences

This feature is meant for small fleets with ephemeral runners. Each retry check is casuing a GitHub API which can trigger a rate limit for the app.
This feature should make ephemerla runners more resposnive without having a pool to pick up missed jobs.
The module allows you to force a job check before scaling, this check should be disabled.
The delay should be set to a time that is higher than the normal boottime of a runner.

Testing

Testing can be done as follow

Trigger a workflow
Terminate the created instance before the job starts
Wait, after the delay the retry job should publish the message again which triggers a new instance creation.
[x] Multi runners.
[x] Default runners, not enabled requires configuraton update

Tasks

[x] Update docs
[x] Update multi-runner
[x] Check CMK keys for SQS
[x] Limit delay to max delay of a queue.
[x] Add optional metric for retry
[x] Update issue with more details

philips-labs / terraform-aws-github-runner