machulav / ec2-github-runner

On-demand self-hosted AWS EC2 runner for GitHub Actions
MIT License
747 stars 337 forks source link

Make it possible to re-use active runners for a few workflow runs #4

Open machulav opened 3 years ago

machulav commented 3 years ago

Notes

vroad commented 3 years ago

How could this be implemented? Something like Cluster Autoscaler?

https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html

To make this work in cluster autoscaler way, you need to set up autoscaling groups and serverless app that terminates idle nodes.

Or, you could create cloudwatch alarm that scales out ASG when a SQS queue has pending messages, and scales in when SQS becomes empty for a while.

non-fio queue could deliver the same message twice , so FIFO queue would work better.

vroad commented 3 years ago

Calling this action with stop mode is no longer required if we use those methods?

If we create lambda function that periodically watches runner, stop action is useless. SQS message's retention period can be short as 60 seconds, we could use that for emptying the queue, but setting too short value might terminate instance too early. Or, we could consume message manually and use retention as fallback, in case stop fails?

machulav commented 3 years ago

@vroad thank you for your ideas!

I thought about a bit different solution:

In such a way, you should be able to gain the following benefits:

Does it make sense?

vroad commented 3 years ago
  • When the action starts a new EC2 instance, some special code can be run, which monitors the active processes on the EC2 instance. If the EC2 instance is middle longer than some specified time, it can terminate itself and deregister the self-hosted runner on GitHub. However, this is the most unclear thing in this solution and should be verified properly.

To reliably stop idle instances, the monitoring program should run outside of the instance. Otherwise if the instance become unresponsive for some reason, it won't terminate.

If the instance is in ASG, unhealthy instances will get terminated, and new instances comes up as long as desired capacity is bigger than 0.

AWS doesn't always mark unresponsive instance as unhealthy, though. To stop such instances you'll need custom health check. To save cost we can't keep ALB running, perhaps? So the only option left to us is custom lambda-based health check. Instances that does not report status correctly should be terminated. Could be done without ASG, but no replacement instances come up.

jpalomaki commented 3 years ago

Just a random thought: would it be possible to use a mix of scheduled and workflow-run event-triggered GitHub workflows to manage the pool of self-hosted runners (using ec2-github-runner action to start/stop them)?