Open ay0o opened 3 weeks ago
Thx for taking the time to create the issue. We also use the setup with a large org. For onboarding repo's to runners (app and groups) we have automations in place that are self service. But agree without automation this won't scale.
What we do different is we run just about 10 different runner groups for the full org in ephemeral mode. Which menas nothing is shared between repositories. The multi runner setup is created with in scope to make it easy to support several fleets. It is amazing that you are already 30 and up. We have choosen ad that time (and time contstraint) to not make the lambda logic more complex, but simpley deploy the control plane one time per configuration. Because in this we we could use all the existing logic.
At that time the webhook was still storing configuration the hte lambda environment variable, which is stops working and some moment (about 6 - 10 groups). For that reasone we moved the configuration to SSM. But indeed with again a limiation due to scaling.
The question is now what is the best way to move forward. Would it make sens to move the configuration again? And what are the valid options? The good news is that configuration is no manged on one place (ConfigLoader), which menas that adding or chaning the direction is releative simpel.
Also wondering with over already 30 configurations for the multi-runners, are you not hitting other limitations?
By " If the runners are shared by all repositories in the organization" I meant, if the runners were deployed at the organization, just like you're doing. In fact, it seems we were doing the same, different instances of the module (because we can't deploy to different VPCs within a single instance, this would be a nice feature) with different pool of runners, all of them ephemeral. The largest had 9 runner configurations.
As said, it just works. However, the company is now demanding more visibility about how much each project is spending and this includes the GitHub runners. In order to provide this, we need to have different runners configuration for each project (different matcher labels, runner group, and a Project
tag for Cost Explorer). And here's where we hit the wall. We started to add the configurations per project and at about 15, we got the error that we exceeded the allowed size for the parameter in Parameter Store. The advanced tier doubles the size from 4KB to 8KB, so I'm assuming I could reach about 30 configurations but didn't actually test it because that wouldn't work for us either.
Maybe could it be possible to store each config independently instead of storing the full multi_runner_config
map in a single parameter?
I have found that using the SSM parameter to store the multi-runner configuration is quite limiting in terms of scalability. Even using the premium tier, it allows about 30 different configurations maximum. You might think this is enough, but check this out.
Assume an organization with dozens of projects, each with several repositories. If the runners are shared by all repositories in the organization, everything is good.
However, let's say we want each project to have their own runners. This could be as easy as creating new runner configs within multi-runners config using different labels per project to choose a runner (e.g.
self-hosted, project_1
,self-hosted, project_2
). The problem is that, as mentioned above, as the tool is storing all the multi-runners config in a single SSM parameter, we reach the maximum size at about 30 configurations.So, the alternative is to actually deploy an instance of this module per project but this leads to another issue. If the GitHub Apps (one per instance) are installed at the organization level this module breaks due to cross-project usage.
For example, let's say a job from a repository that belongs to
project_1
is triggered. The message will be sent to the webhook, but as the GitHub Apps are installed at organization level, any might receive it. This means that it was maybe the webhook forproject_2
the one that received the message from the job with labelsself-hosted, project_1
.Depending on whether
repository_whitelist
is used or not, the message in the webhook will be different (not authorized or unexpected labels), but the ultimate outcome is that the webhook will not publish a message in SQS and therefore, the EC2 instance will not be created.The only working solution is to install the apps on specific repositories. For every new repository, the GitHub App needs to be installed in it and the repository should be also added to the project's runner group. Depending on the size and how active is the organization, this may be manageable. For me, sitting at over 1k repositories, I can tell you it's not.
So, the bottom line is that the tool should use a different approach to the SSM parameter to store the multi-runner config so that a single instance of the module can scale to hundreds of configurations if needed.