options to avoid terminating/spinning up new instances

Srikanth1992 commented 3 years ago

Hello @npalm @gertjanmaas @mcaulifn

I have changed the instance_initiated_shutdown_behavior to stop( https://github.com/philips-labs/terraform-aws-github-runner/blob/master/modules/runners/main.tf#L64) and created all the resources and configured the idle config for our runner instance.

So my github runner configuration has the following values: idle_config = [{ cron = " 9-17 *" timeZone = "America/New_York" idleCount = 2 }]

minimum_running_time_in_minutes = "35"
Maximum runner instance is set to 3

I have triggered like 30 github actions and scale up lambda created 3 ec2 instances and completed all 30 github actions.

since I had IdleCount as 2 After 35 minutes the 3rd ec2 instance should have stopped based on the instance_initiated_shutdown_behavior values, but instead that 3rd ec2 instance got terminated.

Whenever so many github actions are created, looking for options to avoid terminating/spinning up new instances.
Do we have any way to stop/start the runner instance instead of terminating them whenever more github actions are triggered?
Is there any way to stop/start github runner instance ?

Please let me know if you need more information.

gertjanmaas commented 3 years ago

Hi @Srikanth1992

This module does not support starting/stopping of EC2 instances and runners. Implementing this would add a number of new issues (e.g. what should happen when the EC2 disk is full). Just terminating and creating a new instances saves a lot of trouble.

If you have reasons that require starting/stopping instances please share.

Srikanth1992 commented 3 years ago

Hi @gertjanmaas

Thank you for quick response.

The purpose of asking stop/start of runner instance is because our runners instances needs to use our own deploy script since we are using custom AMI and it's taking 15min to full provision a instance.

For example, I have idle config set as morning 8 am to 20pm, and idle count as 4 runner instances if any developer triggers github actions at 8 am, it's taking almost 20minutes to get the instance ready with github runner software and started finishing the github actions and those 4 runners are in running mode from 8 am to 8pm.

Let's say if there are no actions triggered around 14pm - 16pm(2-4),these runner instance are idle right, we want to stop & start them via automation to implement some cost savings.

Let me know if you need more information

Thanks, Srikanth

mcaulifn commented 3 years ago

@Srikanth1992 We build a custom AMI so that launch times are quick. I would suggest looking at Packer.

Srikanth1992 commented 3 years ago

Hi @mcaulifn, @npalm @gertjanmaas

Thank you for quick response.

we build our custom AMI too, but when we use that custom AMI to launch instance , we need to pass our deploy script(this script includes few steps like adding the instance to realmd, installing splunk forwarder, installing newrelic infra agent etc) while launching a instance. This would take 10-12min to finish.

My issue here is If I don't use idle config, then every time developer triggers actions, it creates a runner instance and the github action has wait till the ec2 instance is fully ready(complete our deploy script,complete runner binary installation), then it starts running github actions.

When I use the idle config, there are some time intervals during bussiness hours when runners will be idle, during that window I want to implement this feature stop/start, when runner instance is idle, it should stop that instance, and any github action is triggered, the stopped instance should start and pick up that action to finish it.

So that here I'll be using idle config in my github runner automation and also I'll be having some cost savings.

I think this request can be consider as Feature/enhancements to this current modules.

Let me know if you need more clarification.

Thanks, Srikanth

gertjanmaas commented 3 years ago

we build our custom AMI too, but when we use that custom AMI to launch instance , we need to pass our deploy script(this script includes few steps like adding the instance to realmd, installing splunk forwarder, installing newrelic infra agent etc) while launching a instance. This would take 10-12min to finish.

Is it not possible to install those things inside the AMI? Then only do the configuration in the startup script?

As I said before this is not the intent of the module. But if there is a good way to implement this, in a simple way that won't compromise our current implementation, then I'm okay with adding this.

skyzyx commented 3 years ago

I'd have to say I very much agree with @mcaulifn and @gertjanmaas. The proper solution is to shift the installation of these things "left" so that they are fully pre-baked into a dedicated AMI built exclusively for this purpose.

I can confidently say that the Splunk forwarder and New Relic Infrastructure Agent can both be pre-installed into the root AMI, then just pass the license keys for different environments in as either user_data or as environment variables (if you're using it for Amazon ECS or EKS).

Honestly, the entire purpose of this project is to enable dynamically-scalable runners. I feel like if you're not wanting them to be dynamically-scalable, you're probably using the wrong project. Just stand-up 3–5 runners the normal way.

ScottGuymer commented 2 years ago

I am working on an pre-baked image to help simplify the install process and speed up start times. Take a look in #1444

ppolewicz commented 2 years ago

In some cases using an AMI is not an option - it takes like an hour to build one and then the startup time is abysmal, because we have a huge file that needs to be read into memory on start, but due to the way EC2 EBS are created from AMIs (lazily cloned from S3), sequentially reading a file that has a few GB can take several minutes (at least in the region we operate) when you read it the first time. AMIs are good when they don't contain a lot of data, I think. There is an option to make an EBS hot so that cloning it is eager, but that's like $500/mo in our case and I don't think you can apply it to AMIs anyway.

Therefore an option to shutdown instead of terminate an instance would be a nice additon, and also we'd need the ability to use an existing powered off instance if one is available, before we decide to make a new one.

ppolewicz commented 2 years ago

To give more info about our use case:

huge boxes in terms of CPU and RAM, nothing you can have at home
quite a few tests
each test takes like a minute to execute, so we need multiple workers
a team is working on this and as they do, they'll push to PRs because running at test on their dev station is out of the question

Running a single box is thousands of dollars per month per piece to make it execute in a reasonable time in parallel is a ton of $. Creating a new machine every time with cloud-init or having it start up for 12min because EBS doing a lazy copy from AMI and big file seems... suboptimal.

BTW theoretically, if I'd provide a typescript developer, could someone here supervise his work on creating the PR for this change?

ScottGuymer commented 2 years ago

I'm not fully opposed to the idea. I can see there might be some use cases for it when you have specialised runners with complex config and slow startups. (for instance with windows pools)

I did have a couple of thoughts

The time needed to clone a huge AMI ready to be started can be an issue and depending on your use case could be a deal breaker. Which I get.

In some cases using an AMI is not an option - it takes like an hour to build one and then the startup time is abysmal

The AMI build time is largely irrelevant as it's something out of the scope of the VM launch/GH build time.
By stopping and starting a VM would you not encounter the same startup time? (minus the cloning time) You would still have to read the huge file into memory.

Are there other ways of bringing this huge file locally at boot to improve the AMI clone speed? Maybe mounting it from a single EBS?

Currently the credentials for GH are generated and passed to each runner individually and then deleted when they are used to start the agent.. This could pose some issues when stopping/starting agents and would need some thought.

I think that this behaviour would need a lot of changes to the lambdas to be able to identify a pool of stopped instances and then control them as needed.

Maybe it could be integrated with the recent pool changes #1577 to allow the pool to be shut down when idle..

npalm commented 2 years ago

link to discussion #1428

ppolewicz commented 2 years ago

The first read of a file after creating an instance from the AMI is very slow. On the second read it's way faster though because that data is now actually stored on EBS. Why does the AMI thing not copy the entire thing to EBS when creating an instance (unless you pay $500) is beyond me.

The AMI creation/uploading process is slow in itself. We want a fast iteration cycle. It turns out that running a stock Ubuntu/Amazon Linux and fetching the data from S3 is very, very fast, so that's what we are doing. If we can use terraform-aws-github-runner, that'll allow us to get 10 or 20 workers to run and the first time they run, they'll get data from S3 (which is way faster than getting it from AMI), but the next time we won't need to get or install anything (except for the test data from github repo) so the startup time will be very small and the entire CI will finish quickly.

Things to do:

[ ] temporary credentials for GH to pass to runner are deleted after the agent starts (don't ever delete them I guess?)
[ ] identify a pool of stopped instances to use instead of starting a new one / starting more VMs
[ ] actually top up the pool to the required level from the stopped ones (but create more if that's not enough)
[ ] shutdown instead of terminate when the job is done
[ ] test to see if it all works as intended
[ ] document new usage

Did I miss anything major?

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs. Thank you for your contributions.

ppolewicz commented 2 years ago

I may have missed the fact that this may not work very well with multiple lambdas trying to access the same cluster unless a locking mechanism is used to prevent two lambdas from trying to activate the same server.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs. Thank you for your contributions.

ceejatec commented 1 year ago

For what it's worth, this would be very useful for us too. We've already got a bank of self-hosted runners in datacenter, and we're looking to augment in AWS. A lot of our jobs make use of the fact that the existing runners are long-lasting, in particular for caching things like Go workspaces, Bazel stuff, etc. With these AWS runners, we only get the benefit of caching between jobs if the jobs come in fast enough to avoid the instances being terminated.

From my experience with the runner toolkit in-datacenter, I don't think there are too many issues that need to be addressed here... in particular, once you've configured a runner, it stores whatever keys and things it needs locally, and that doesn't need to be refreshed when you stop/start or reboot the instance.

I'm sure there are complications when you consider pooling runners and having multiple lambdas sharing a cluster and so on, but for the simple cases, it really would be nice to have the option of having the lambdas just Stop the instance and then restart it when it's needed.

philips-labs / terraform-aws-github-runner

options to avoid terminating/spinning up new instances #1166