microsoft / openvmm

Home of OpenVMM and OpenHCL.
http://openvmm.dev/
MIT License
1.36k stars 59 forks source link

Support multiple Github actions runners on the same machine #158

Open tjones60 opened 4 days ago

tjones60 commented 4 days ago

Use separate rustup and cargo directories for each Github runner in case there are multiple on the same machine, and introduce retries for apt-get to avoid errors related to multiple processes trying to install things. Also makes sure that GCC is installed on Linux.

jstarks commented 1 day ago

This way lies madness. If we need to support multiple concurrent jobs on the same physical machine, then we need to run those jobs in containers or VMs.

daprilik commented 1 day ago

John brings up a good point, and its one I probably should've realized sooner when discussing this work with you...

Lets hold off on this PR until we align on the broader goals we're trying to achieve here

tjones60 commented 1 day ago

This way lies madness. If we need to support multiple concurrent jobs on the same physical machine, then we need to run those jobs in containers or VMs.

The idea here is to make better use of the physical ARM test machines that we have, since nested virtualization is not supported yet. I've tested this with 4 concurrent runners on the same machine and it works well. Also, this might help mitigate some occasional failures on our current 1ES CI machines due to apt. Also, I think these changes are a good idea whether or not we have multiple runners on the same machine, in case there are other processes running on them.

daprilik commented 1 day ago

The apt changes are def pure goodness, and whether its as part of this PR or a new PR, we should make sure those land. While it's certainly a bit suspect that we keep seeing those dpkg lock failures on CI machines that are ostensibly wiped between runs, it does seems like a common enough issue reported my many folks that I'm willing to chalk it up to CI weirdness, and work around it.

I think these changes are a good idea whether or not we have multiple runners on the same machine, in case there are other processes running on them.

I think this falls under the umbrella of "this way lies madness". We wouldn't want this happening in CI, and on a user's local box, I think its reasonable to assume folks are rational actors aren't trying to mutiplex system-wide tool installs on a regular basis

since nested virtualization is not supported yet

Ahh... is this so that the same ARM host box can run multiple VMM test jobs from multiple pipeline run submissions?

tjones60 commented 1 day ago

Ahh... is this so that the same ARM host box can run multiple VMM test jobs from multiple pipeline run submissions?

Yes this is the primary motivation for this change. We could go without it, but CI will be slower.

jstarks commented 1 day ago

We're going to cross compile from x64, right? So this is just for unit and VMM tests.

Can we just up the concurrency limits on ARM tests to better utilize the machines (and free them up the for the next user sooner), rather than try to run multiple jobs concurrently?

jstarks commented 1 day ago

For Linux, we could also run in containers. For Windows that isn't an option, really, due to the lack of WHP support in Windows containers. (Maybe we should get that fixed.)

tjones60 commented 1 day ago

We're going to cross compile from x64, right? So this is just for unit and VMM tests.

Can we just up the concurrency limits on ARM tests to better utilize the machines (and free them up the for the next user sooner), rather than try to run multiple jobs concurrently?

We could cross compile from x64, but the machines have lots of extra resources, so I was thinking of using them for some of the compilation jobs as well. Those jobs could use VMs or containers though, even on Windows. The unit tests probably also don't need virtualization support, so this would really just be for VMM tests.

Increase concurrency limits might help a little in the future, but right now most of the time is spent doing things that aren't easily parallelized within the context of a single job.

daprilik commented 1 day ago

@jstarks, trevor shared some of the preliminary numbers from running build jobs on the ARM boxes, and the results are really good. We are talking 2x improvements over the existing 1ES runners. I'm all for leveraging them for our compile jobs, including for cross-compiling x86 from ARM.

@tjones60, can we improve VMM test execution speed by caching VHDs on a shared persistent disk? there'd be some flowey tweaks to get persistent_dir working with these runners, but all the logic for caching is there.

jstarks commented 1 day ago

OK, great. Then we can even containers on Windows for the build steps.

Concurrent jobs on the same OS instance is not something we want to do.

jstarks commented 1 day ago

The apt changes are def pure goodness, and whether its as part of this PR or a new PR, we should make sure those land.

Maybe we should remove the retry loop for the local case (does this run in the local case?)

daprilik commented 1 day ago

The apt changes are def pure goodness, and whether its as part of this PR or a new PR, we should make sure those land.

Maybe we should remove the retry loop for the local case (does this run in the local case?)

It does run in the local case, where it'll prompt the user to install required packages during build-igvm invocations and such.

I think its reasonable to only run once when using the local backend, and have it run N times on ADO / GH, with a clear comment explaining that it is a workaround for CI infra

tjones60 commented 1 day ago

I opened a new PR with just the apt retry changes: https://github.com/microsoft/openvmm/pull/174