Further investigation into self-managed gha runners - Githubissues

ministryofjustice / operations-engineering

This repository is home to the Operations Engineering's tools and utilities for managing, monitoring, and optimising software development processes at the Ministry of Justice. • This repository is defined and managed in Terraform

https://user-guide.operations-engineering.service.justice.gov.uk/

MIT License

12 stars 5 forks source link

Further investigation into self-managed gha runners #4450

Closed levgorbunov1 closed 1 month ago

levgorbunov1 commented 3 months ago

User Need

As a Github Enterprise admin team I want A failover system to deploy self-managed gha runners able to build containers so that CICD processes may continue when the gha quota is saturated.

Value This has been investigated twice now. Conclusion from first investigation was that we can not run priveleged containers on CP and therefore cannot build containers using Docker. Conclusion from most recent firebreak investigation was that it may be possible to build containers using podman on CP in a rootless fashion, however a complete solution for this was not found (https://github.com/ministryofjustice/operations-engineering/issues/4418). It may be worth (1) continuing to investigate building containers in a rootless way, (2) investigate creating own infrastructure to run privileged containers e.g. EKS (potential high maintenance cost) or deploying ECS task to EC2 ASG (maybe lower maintenance cost?).

Acceptance Criteria:

Decide on which strategy to investigate further: rootless or rootful.
Create a POC showing how we can possibly spin up self-managed gha runners which can build containers.

levgorbunov1 commented 1 month ago

Plan:

Setup own EKS cluster in Mod platform using AP modules
Setup Karpenter for node autoscaling
Setup KEDA for container autoscaling
Fork AP runner image repo - https://github.com/ministryofjustice/analytical-platform-actions-runner
Setup Docker
Setup trigger to deploy self-managed runners to culprit repositories when minutes run low

levgorbunov1 commented 1 month ago

Spinning up own cluster adds complexity and breaks current hosting strategy.

Perhaps think about using another EKS cluster e.g. using AP's cluster or asking CP to spin us up a cluster?

levgorbunov1 commented 1 month ago

Also worth considering what other things besides infrastructure can be done to reduce quota consumption e.g. making repos public

levgorbunov1 commented 1 month ago

Are we sure that we need to run priveleged containers? Are there container operations in the culprit pipelines?

levgorbunov1 commented 1 month ago

There are a couple of internal repositories with high actions minutes consumption which do not do container operations in their cicd. It may be possible to create self-managed non-priveleged runners for these, on CP or ECS Fargate in Mod Platform.

https://github.com/ministryofjustice/opg-org-infra/tree/main - 1,368 minutes in July - Charles Marshall:

Met with Charles Marshall 23/07/24
Going to tune down most thirsty workflow (secret rotation)
Quota exhaustion doesn't have much of a disruptive effect

https://github.com/ministryofjustice/staff-infrastructure-azure-landing-zone/tree/main - 1,252 in July - Alan Collier:

Meeting with Alan Collier 24/07/24
Outages are not particularly disruptive
Unclear why this repository is internal
Agreed to look at turning down terraform static analysis high consuming workflow

levgorbunov1 commented 1 month ago

Couple of internal repos which do docker operations in their cicd; I asked whether these need to be internal:

https://github.com/ministryofjustice/opg-sirius-infrastructure - 1,171 in July - Charles Marshall:

Quota exhaustion has a more serious effect, blocking the work of ~10 developers.
However, minutes usage is only high at the minute due to an ongoing migration project.

https://github.com/ministryofjustice/Wardship - 1,207 in July - Mark Butler:

Mark has made the repo public, and is looking into making some other repos public, saving us minutes!

levgorbunov1 commented 1 month ago

Conclusion:

As part of this project we were able to save at least 1645 minutes by converting internal repositories into public ones, reducing pressure on the internal gha minutes quota. Savings are also likely to be made by contacted teams tuning down high consuming workflows.

An infrastructural solution? Given that we were able to reduce the pressure on the quota using methods alternative to spinning up new infrastructure, this should stave off this issue becoming critical in the near future, particularly given @AntonyBishop is looking at increasing quota size. However, it is difficult to predict how quota consumption may change in the future, given we are investigating the potential of migrating CircleCI pipelines to gha, some repositories are in the process of being made redundant etc. therefore it may be worth investigating the possibility of deploying infrastructure as a failover mechanism incase we hit the quota limit, potentially causing disruption to teams across the organisation.

It is the case however that some more critical internal repositories which experience more serious disruption upon quota saturation (opg-sirius) perform privileged pipeline operations, something that we currently don't have the infrastructure to mitigate due to security limitations, therefore at this time we can not offer a self-managed solution to these customers. There are a few less critical repositories which are less affected by downtime which do not perform privileged operations in their cicd (azure-lz,opg-org-infra) and therefore are potentially candidates for an infrastructural solution based on non-privileged self-managed runners, however it is whether the cost of the infrastructure would be justified by the risk posed to these customers by complete quota saturation, as previously mentioned these repositories aren't critical.

This project focussed on the top 4 offending internal repositories all with over 1k minutes consumed/month, however there are potentially more customers affected by this problem. It is now up to the team as to whether we want to entertain a failover system based on non-privileged self-managed gha runners to provide customers which need to run unprivileged cicd workloads 100% runner availability in the event of complete quota saturation in the future or whether to rely on these alternative mechanisms of controlling minutes usage practiced in this project.