pytorch / test-infra

This repository hosts code that supports the testing infrastructure for the main PyTorch repo. For example, this repo hosts the logic to track disabled tests and slow tests, as well as our continuation integration jobs HUD/dashboard.
https://hud.pytorch.org/
Other
73 stars 69 forks source link

[RFC] Refreshable GHA runner environments #5391

Open seemethere opened 3 weeks ago

seemethere commented 3 weeks ago

Context

Given the insecurity of long running non-ephemeral instances we have a need to develop an ephemeral environment for which to execute our Github Actions workloads.

Ideally any solution that we pursue should ideally have a couple of parameters that should be met:

What could a potential solution look like?

We can utilize rootless docker in docker to achieve most of these goals where we run a singular container as the GHA daemon and a sidecar container as the rootless docker in docker daemon (without --privleged to avoid jailbreaks). From there we can have build the containers to automatically exit after the GHA daemon completes and have them refresh using something like docker compose to manage the containers at the local level.

If we utilized this approach we could also go forward with utilizing something like cgroup slices to also do partitioning of larger nodes into smaller nodes by assigning cgroup slices to both the GHA daemon container as well as the docker in docker container to ensure they don't over-utilize resources on the node.

seemethere commented 2 weeks ago

So I did some experiments with this (https://github.com/seemethere/refreshable-infra) over the past week and I don't think there's actually a way of achieving docker in docker without running --privileged unfortunately.

Basically if we want to use docker within our CI our options for refreshable infra become pretty limited.

There is some hope though as I discovered a pretty obscure AWS feature which allows you to replace the root volume of a running EC2 instance as a way to do a hot swap which might prove promising but is far from the ideal approach of having a solution that could be vendor agnostic.

jeanschmidt commented 2 weeks ago

It would be very difficult to handle the refreshable-infra with docker containers with somewhat restricted permissions if we need to support Docker. Maybe we should migrate the workflows to refreshable infra and in simultaneously drop support for Docker in our workflows.