philips-labs / terraform-aws-github-runner

Terraform module for scalable GitHub action runners on AWS
https://philips-labs.github.io/terraform-aws-github-runner/
MIT License
2.59k stars 622 forks source link

Feature request: stop instead of terminate when scaling down #4033

Open nap opened 3 months ago

nap commented 3 months ago

First off, thanks to everybody who contributes and maintains this project. I really appreciate how reliable this project has been. Great work that is being done here. Its super helpful to us and it saved us so many hours of developments.

I was wondering if there was any discovery done around stoping runners instead of terminating them?

In our use case, we create custom AMI and have up to 50 runner idle during working hours. We configured it so we can scale up to 250 instances, but scaling up to 250 can be slow. Starting an instance from scratch is slow since EBS needs to be provision with the AMI and other AWS specific things like networking.

I recently ran tests and saw that by stoping and starting instances instead of terminating them yield interesting results, it reduces the start-up time by more than half.

I am wondering if its a reasonable feature request to:

  1. if inside idle_config.cron
  2. scale up to runners_maximum_count
  3. then stop runners_maximum_count - idleCount instances
  4. then scale up or down using start/stop instead of terminate
  5. if outside idle_config.cron terminate instances

This would mean an additional cost for EBS volumes, but no added cost for compute since stop instances do not count.

Here's an article that also talk about speeding up startup time of instances. Thanks!

npalm commented 3 months ago

We our deployment we use only ephemeral runners. Which means we not re-using any instance for security reason. Indeed the penalty is the boot time. We are able to have a boot time of under a minute by using pre built AMI's. Our scale is about up to 400 instances a max, with a small pool (about 10) to ensure most jobs will start fast.

When the module was created (first version) no ephemeral runners where supported. Initially the module is developed with short live VM's to avoid issues like full disks, memory exhausting and limit the security risk for org level runners (shared environment).

A PR to allow stopping / starting instances including safeguarding state is whiped could be intersting. For instance store this is safeguarded by AWS. For EBS the state is persisted.

nap commented 2 months ago

Yes, instance EBS reuse is definitely conflicting with your mode of operation. This can be alleviated by running jobs in a container. For us, since we do not have the same security concernes, its not a challenge.

I could see a maximum_running_time_in_minutes being useful to limit instance reuse. When an instance becomes idle and maximum_running_time_in_minutes was reached, it gets terminated. Then a new instance get started to pickup a job or get stopped to be eventually started when needed.

Thanks for your interest in this proposal @npalm