philips-labs / terraform-aws-github-runner

Terraform module for scalable GitHub action runners on AWS
https://philips-labs.github.io/terraform-aws-github-runner/
MIT License
2.46k stars 588 forks source link

[question] Scale-down via webhooks, Webhook durability #2689

Closed foragerr closed 1 year ago

foragerr commented 1 year ago

This project looks great - especially the upcoming 2.0 changes! Kudos maintainers!

I'm currently running self-hosted runners through an ASG that scales up and down via cron triggers. It isn't exactly "auto-scaled" but meets our needs at the moment. I'm looking to move to a purely ephemeral setup, where

I have a couple of questions:

  1. The docs (both main and v2.0.0 tag) say

    • Scaling down the runners is at the moment brute-forced, every configurable amount of minutes a lambda will check every runner (instance) if it is busy.

    • Is this still true in 2.0? Is there any desire to implement termination based on workflow_job completed event?

  2. Looking at the architecture diagram, it looks like once a webhook event makes it into the sqs queue, it is guaranteed to be processed at least once. But it is possible to lose a webhook event prior to it being put into the queue? (transient network blips, or webhook lambda failure). Is this a valid reading? Has anybody had issues with this in practice?

I really appreciate your attention!

npalm commented 1 year ago

Quick answers

  1. Scale down is indeed turning down instances brute force, but only instances that are not active will be turned down. When running ephemeral this should in general not be needed. But is also not causing problems. Termination for ephemral irunners is done once the job is completed by the instance it self. Hooking in on other events in complex, since we not control which runner got which job.

  2. Since HTTP is not reliable there is no gaurantee. But all messages that are arrived and accepted are put on a queue. The scale-up lambda try to process messages. In case an error happends and the happen is in a category from which it can auto recover such as API rate limit the messages is sent back to the queue. This all in certain max time windows. Most of the stuff is configurable.

Besides that you can setup a pool, in that case ephermal instances will be created based on a cron expression. We creat so now and then a few via cron to ensure all jobs got processed.

foragerr commented 1 year ago

Thank you for the detailed response!