want target for multi-node live migration jobs

jordanhendricks commented 1 year ago

A concrete goal I have for propolis is running inter-machine live migrations of representative guest workloads in CI (oxidecomputer/propolis#360). The world I would like to move toward is collecting the various product assurance guest test configurations and workloads we have in mind (oxidecomputer/product-assurance#1 and https://github.com/oxidecomputer/product-assurance/issues/17 come to mind) and migrating guests running such workloads. Migrating guests with these workloads in CI will be very valuable on its own, but we could also include tests for performance regressions building off of existing work (oxidecomputer/propolis#347 and oxidecomputer/propolis#324 more generally).

While I am focused on propolis live migration specifically, in building this feature we will lay the groundwork for tests that will surely serve the rest of our upstack software, including omicron (even running omicron-drive migrations, perhaps) and crucible.

As a minimum starting point, I think a motivating example is doing a test I was doing a lot for the recent time-related live migration work: In a shell one-liner forever while loop, call a simple binary that reads the TSC, then sleeps for a second; migrate the guest and observe the guest TSC looks right on the other side. This test is nice because it doesn't require anything in the way of networking, only a disk, a serial console, and some minimal post-processing. For an image, we could use a file-backed disk that lives on the lab NFS share.

As for buildomat specifics, @jclulow and I chatted about some ideas here last week. At a minimum, we will need:

a reserved 2-node cluster in the lab for this type of testing
a new type of target in buildomat for multi-node jobs
a basic set of synchronization primitives for scripts running on each machine. We will probably at least want a barrier primitive, so the two machines can wait at a given point before continuing.

The mechanics of how we orchestrate the migration mechanics I need to flesh out a bit more and will continue adding to this ticket as I do research and prototyping. One could imagine us using a script on one of the hosts to talk to the server API directly to coordinate setting up an instance (along with disks, vnics etc) and migrating it. Since we already have PHD, though, it would be nice to leverage that where we can. PHD today does not support networking I believe, so to do so we would need to add it. I am less certain of how easy it would be to modify PHD to do a server-like model here to coordinate some of the mechanics, but I am aware that @gjcolombo has thoughts.

gjcolombo commented 1 year ago

Some initial unedited thoughts:

I think it should be possible to refactor PHD's framework crate to do a lot of what we want here. The way it works today is that

Each PHD test case gets a VM factory as a parameter in its test context. The factory is told (by the PHD runner executable, which creates the factory and test context) where the propolis-server binary under test lives.
Creating a new VM starts a server with the appropriate parameters, then returns a TestVm that wraps
- the Propolis client that should be used to connect to the server during tests
- a handle to the Propolis server process for this VM (mostly used for RAII)
- assorted other information about the VM that the framework needs (e.g. information about the guest OS type)

Most of the problem here is that the VM factory and TestVm struct have baked in the notion that Propolis servers run on the same machine as the test runner. I think we could refactor that as follows:

Create a separate PHD binary, the "PHD agent," that runs the framework and a small Dropshot server that provides an interface for spinning up Propolis processes
Change the VM factory as follows:
- Tell the factory what PHD agents are serving on what addresses/ports
- Create a "remote Propolis" analog to the framework's existing PropolisServer type
- Add a mode switch to TestVm::new that creates either a local Propolis or a remote Propolis via one of the selected agents

Then, in CI, we would need something like the following:

The test job runs on N machines: one machine executes a script that launches the PHD runner, and the others execute scripts that run PHD agents
Some Buildomat magic (that I have not figured out in any detail) will have to determine the set of addresses/ports the PHD agents will serve on and supply that as a parameter to the PHD runner script
The PHD runner script passes the agent list as an argument to the PHD runner executable, which feeds it to the VM factory, which enables the creation of remote Propolis processes

After that, tests that want to create a remote Propolis just ask the factory for one. So, for example, to make the existing LM smoke test cross-machine, we substitute the call to new_vm_from_cloned_config with some other call (or add a parameter) that specifies that the VM should be created on a machine other than the one that's hosting the source VM, and the factory and framework reason about the location of the source and choose a target that meets the required constraints.

gjcolombo commented 1 year ago

That was a big brain dump that contains a lot of inside baseball in the Propolis repo. The key point in the above design from a Buildomat point of view is that, for this model to work, we need to be able to work out enough configuration a priori (like address/port allocations for the processes we want to talk to on the "secondary" machines in the job) to tell the test runner how to stitch together its own control plane from the disparate jobs running on each machine.

jclulow commented 1 year ago

Yes that all seems quite reasonable. I agree buildomat should be able to provide a different job program for each node; e.g., you could do your server always on node 1, and node 2 is always the client. I would like buildomat to be able to provide an API for the job to:

list other nodes in the job, and their IP addresses
pass a baton of sorts between the job program that runs on each node; e.g.,
- the job program for node 1 might implicitly with the baton, do some config, and then pass it:
```
$ ./start_phd.sh
$ buildomat baton pass -n phd
```
- this would then block on node 1; in the meantime, node 2's program would start by taking it:
```
$ buildomat baton take -n phd
$ ./start_phd_client.sh $(buildomat nodes -Ho ip 1)
$ buildomat baton pass -n phd.
```
- node could wait for the baton back to know that the other side had finished with it:
```
$ buildomat baton take -n phd
$ ./run_tests_through_local_phd.sh
```
- at the end, we could have a general barrier; e.g., both node 1 and node 2 could be waiting on:
```
$ buildomat barrier end-of-tests
$ ./report_test_results_or_whatever.sh
```
The barrier would block waiting for all nodes to arrive before releasing.

If another node experiences a fault, we could "poison" the baton or barrier so that the script blocking on it on the node that hasn't failed will get an error while trying to acquire. This should probably be pretty easy to program in a shell script -- for either ping-pong between two programs, or coordinated arrival at various points.

Then your PHD stuff depends on buildomat for limited coordination during startup, node/network discovery, and the usual log output and output artefact collection, etc.

internet-diglett commented 1 year ago

If I may piggyback on this idea, networking would also love multi-node test targets for deploying omicron into an environment with two "scrimlets" running SoftNPU. This would essentially take what is currently launched in the Buildomat deploy task of Omicron, but cluster it with a second host that has a sled-agent running in scrimlet mode. This would allow us to exercise multi-switch logic from the control-plane all the way down to Dendrite / P4.

jclulow commented 1 year ago

As part of fixing #19 I have done some plumbing that should hopefully be helpful for this as well. There is now a control program, bmat, that can be run inside the job. It currently provides access to a per-job key-value store, added in e727bf8ba2f00d12bc2b0cf0b3a3c88843d56a2e, that is then used by the GitHub layer to communicate a token to the job after the job is running, in 6cd4797a814969f69872d0c34893eb8835d313f4. The control program has relatively easy access to make requests to the core API using the access token from the agent itself.

I think this will be a good base mechanism to provide access to both synchronisation primitives (e.g., barrier, baton, etc) as well as node discovery from inside the running job.

oxidecomputer / buildomat

want target for multi-node live migration jobs #23