oxidecomputer / buildomat

a software build labour-saving device
Mozilla Public License 2.0
53 stars 2 forks source link

want target for multi-node live migration jobs #23

Open jordanhendricks opened 1 year ago

jordanhendricks commented 1 year ago

A concrete goal I have for propolis is running inter-machine live migrations of representative guest workloads in CI (oxidecomputer/propolis#360). The world I would like to move toward is collecting the various product assurance guest test configurations and workloads we have in mind (oxidecomputer/product-assurance#1 and https://github.com/oxidecomputer/product-assurance/issues/17 come to mind) and migrating guests running such workloads. Migrating guests with these workloads in CI will be very valuable on its own, but we could also include tests for performance regressions building off of existing work (oxidecomputer/propolis#347 and oxidecomputer/propolis#324 more generally).

While I am focused on propolis live migration specifically, in building this feature we will lay the groundwork for tests that will surely serve the rest of our upstack software, including omicron (even running omicron-drive migrations, perhaps) and crucible.

As a minimum starting point, I think a motivating example is doing a test I was doing a lot for the recent time-related live migration work: In a shell one-liner forever while loop, call a simple binary that reads the TSC, then sleeps for a second; migrate the guest and observe the guest TSC looks right on the other side. This test is nice because it doesn't require anything in the way of networking, only a disk, a serial console, and some minimal post-processing. For an image, we could use a file-backed disk that lives on the lab NFS share.

As for buildomat specifics, @jclulow and I chatted about some ideas here last week. At a minimum, we will need:

The mechanics of how we orchestrate the migration mechanics I need to flesh out a bit more and will continue adding to this ticket as I do research and prototyping. One could imagine us using a script on one of the hosts to talk to the server API directly to coordinate setting up an instance (along with disks, vnics etc) and migrating it. Since we already have PHD, though, it would be nice to leverage that where we can. PHD today does not support networking I believe, so to do so we would need to add it. I am less certain of how easy it would be to modify PHD to do a server-like model here to coordinate some of the mechanics, but I am aware that @gjcolombo has thoughts.

gjcolombo commented 1 year ago

Some initial unedited thoughts:

I think it should be possible to refactor PHD's framework crate to do a lot of what we want here. The way it works today is that

Most of the problem here is that the VM factory and TestVm struct have baked in the notion that Propolis servers run on the same machine as the test runner. I think we could refactor that as follows:

Then, in CI, we would need something like the following:

After that, tests that want to create a remote Propolis just ask the factory for one. So, for example, to make the existing LM smoke test cross-machine, we substitute the call to new_vm_from_cloned_config with some other call (or add a parameter) that specifies that the VM should be created on a machine other than the one that's hosting the source VM, and the factory and framework reason about the location of the source and choose a target that meets the required constraints.

gjcolombo commented 1 year ago

That was a big brain dump that contains a lot of inside baseball in the Propolis repo. The key point in the above design from a Buildomat point of view is that, for this model to work, we need to be able to work out enough configuration a priori (like address/port allocations for the processes we want to talk to on the "secondary" machines in the job) to tell the test runner how to stitch together its own control plane from the disparate jobs running on each machine.

jclulow commented 1 year ago

Yes that all seems quite reasonable. I agree buildomat should be able to provide a different job program for each node; e.g., you could do your server always on node 1, and node 2 is always the client. I would like buildomat to be able to provide an API for the job to:

If another node experiences a fault, we could "poison" the baton or barrier so that the script blocking on it on the node that hasn't failed will get an error while trying to acquire. This should probably be pretty easy to program in a shell script -- for either ping-pong between two programs, or coordinated arrival at various points.

Then your PHD stuff depends on buildomat for limited coordination during startup, node/network discovery, and the usual log output and output artefact collection, etc.

internet-diglett commented 1 year ago

If I may piggyback on this idea, networking would also love multi-node test targets for deploying omicron into an environment with two "scrimlets" running SoftNPU. This would essentially take what is currently launched in the Buildomat deploy task of Omicron, but cluster it with a second host that has a sled-agent running in scrimlet mode. This would allow us to exercise multi-switch logic from the control-plane all the way down to Dendrite / P4.

jclulow commented 1 year ago

As part of fixing #19 I have done some plumbing that should hopefully be helpful for this as well. There is now a control program, bmat, that can be run inside the job. It currently provides access to a per-job key-value store, added in e727bf8ba2f00d12bc2b0cf0b3a3c88843d56a2e, that is then used by the GitHub layer to communicate a token to the job after the job is running, in 6cd4797a814969f69872d0c34893eb8835d313f4. The control program has relatively easy access to make requests to the core API using the access token from the agent itself.

I think this will be a good base mechanism to provide access to both synchronisation primitives (e.g., barrier, baton, etc) as well as node discovery from inside the running job.