Extends the existing Dropshot/Progenitor API metadata to identify for each API whether it will be versioned on the server-side exclusively or via client-side versioning as well. (I need to write an RFD on this but for a summary of what I mean see this deck).
Adds a new cargo xtask ls-apis dag-check tool that:
checks the validity of the versioning choices we've made in the metadata (more on this below)
proposes changes (this may never be useful again but it was very helpful for me to make these choices and I thought it'd be worth keeping the code around)
updates the cargo xtask ls-apis adoc command to include this information in the summary table. The result is that the table in our wiki will include this information. See oxidecomputer/meta#524 for what this looks like. (I will land that PR after this one lands.)
Copying a summary from the README, after this PR:
All Dropshot/Progenitor APIs in the system are identified with this tool (as far as we know).
Every API is annotated with metadata in api-manifest.toml.
This metadata specifies whether versioning for the API is managed on the server side exclusively or using client-side versioning as well. We've made this choice for every existing API.
There are no cycles in the server-side-managed-API dependency graph.
This tool verifies these properties.
If I haven't messed this up too badly, this is a pretty big step because:
If you add a new API or a new API dependency without adding corresponding metadata, or if that metadata breaks these constraints, you will have to fix this up before you can land the change to Omicron.
I hope this will help keep us from accidentally introducing new circular dependencies.
Goals
My main goals here, in order:
De-risk this whole API versioning plan by figuring out if there is a plausible update DAG. This involves making the decisions about what components (or dependencies) are in the DAG.
Record these decisions durably in the metadata.
Include the tooling that was used to help make these decisions.
Have the tooling verify that the choices are all valid (i.e., that we didn't create an update graph with cycles).
How I made the choice for each API
Where possible, we want APIs to be server-side-versioned only so that clients don't have to think about this. See the deck I linked above for more on this.
Initially, I had all APIs marked as versioned_how = "unknown", built the check in dag-check that the "server-side" graph be acyclic, and then built some facilities in dag-check for it to propose that an API be client- or server-managed based on heuristics like "if this API has no deployed consumers, then it can be server-side only" and "if two components depend directly on each other, one of them must be client-side-managed".
I also started with the assumption that we'll want APIs that Nexus talks to to be server-side-versioned. That's because control generally flows from Nexus to the other services. It'd be pretty painful if Nexus had to have multiple client versions for every service it talked to. Relatedly, the assumption is that the APIs from Nexus -> other services are generally more complex and volatile than in the other direction.
With this, I iteratively looked at the proposals, checked the visual DAG (from cargo xtask ls-apis servers --output-format=dot) and made some judgment calls.
The results
On success, the tool prints out a summary of how our APIs are versioned:
$ cargo xtask ls-apis dag-check
...
Server-managed APIs:
Clickhouse Cluster Admin for Keepers (clickhouse-admin-keeper-client, exposed by omicron-clickhouse-admin)
Clickhouse Cluster Admin for Servers (clickhouse-admin-server-client, exposed by omicron-clickhouse-admin)
Clickhouse Single-Node Cluster Admin (clickhouse-admin-single-client, exposed by omicron-clickhouse-admin)
CockroachDB Cluster Admin (cockroach-admin-client, exposed by omicron-cockroach-admin)
Crucible Agent (crucible-agent-client, exposed by crucible-agent)
Crucible Control (for testing only) (crucible-control-client, exposed by propolis-server)
Crucible Pantry (crucible-pantry-client, exposed by crucible-pantry)
Maghemite DDM Admin (ddm-admin-client, exposed by ddmd)
DNS Server (dns-service-client, exposed by dns-server)
Management Gateway Service (gateway-client, exposed by omicron-gateway)
Maghemite MG Admin (mg-admin-client, exposed by mgd)
External API (oxide-client, exposed by omicron-nexus)
Oximeter (oximeter-client, exposed by oximeter-collector)
Propolis (propolis-client, exposed by propolis-server)
Sled Agent (sled-agent-client, exposed by omicron-sled-agent)
Wicketd (wicketd-client, exposed by wicketd)
Client-managed API:
Bootstrap Agent (bootstrap-agent-client, exposed by omicron-sled-agent)
reason: depends on itself (i.e., instances call each other)
Dendrite DPD (dpd-client, exposed by dpd)
reason: Sled Agent calling DPD creates a circular dependency
Wicketd Installinator (installinator-client, exposed by wicketd)
reason: client is provided implicitly by the operator
Nexus Internal API (nexus-client, exposed by omicron-nexus)
reason: Circular dependencies between Nexus and other services
Crucible Repair (repair-client, exposed by crucible-downstairs)
reason: depends on itself (i.e., instances call each other)
Repo Depot API (repo-depot-client, exposed by omicron-sled-agent)
reason: depends on itself (i.e., instances call each other)
APIs with unknown version management: none
In terms of the client-side-managed APIs:
Bootstrap Agent, Crucible Repair, and Repo Depot: the component exposing these APIs necessarily talk to each other as peers, so they're necessarily circular and must be client-managed
I marked Installinator API as client-managed because the only client is Installinator itself, and that's actually provided at runtime by the operator. We don't control that. If we were ever to rev this API, we'd need a way to validate that the Installinator in the Recovery OS image in the TUF repo that the operator uploaded to Wicket is compatible with the Installinator API in Wicket. In practice we're very unlikely to rev this any time soon.
DPD currently has a circular dependency with Sled Agent so one of them has to be client-side-managed. I chose DPD rather than sled agent because the Sled Agent API is pretty complex and changes quite a lot.
The Nexus internal API is circular with several other APIs and we've long said we wanted this direction to be client-side managed (explained above).
So all of that seems pretty reasonable and everything else was able to be server-side-managed.
Out of curiosity, I did this quick check to see how volatile each API was: in each of the Omicron, Dendrite, Maghemite, Propolis, and Crucible repos, I ran this command to count how many times each of the APIs has been changed in any way in 2024 (almost 11 months):
$ for x in openapi/*.json; do count="$(git log --follow --pretty=%ai $x | grep -c ^2024)"; printf "%3d %s\n" count $x; done
Putting the results from all repos together and sorting them:
The volatility of the Nexus-internal API, which is unfortunately client-side-managed, makes one wonder if we can rework things so that fewer things need to invoke it (e.g., use long polls in the other direction, or use much simpler APIs that just cause Nexus to reach back out to the other service for details? or in the case of Oximeter, eliminate the use of the API altogether as we've discussed?).
Caveats
This is still a little messy. For example, the metadata has ways to mark both an edge (dependency) and a particular API as client-side or server-side managed.
That's because the tool already supported the edge annotations, but while doing this, I decided it would be way too tedious to figure this out if I had to annotate every edge, when for the vast majority of APIs it's going to be one or the other. We should clean that up in the long term but I think it's a fair bit of work and doesn't help de-risk this plan.
I think there are a few APIs where we might want this choice to be per-edge, not per-API. For example: DPD's API is marked client-managed because of its circular dependency with Sled Agent. That may be unavoidable between those two because there's no way to update DPD before all sled agents, nor to update all sled agents before updating DPD (since both are contained in the host OS deployment unit). However, it should be easy to make sure Nexus (and maybe other DPD consumers) are updated after DPD. And that'd be very valuable. So we may want to say that some of these edges are client-side-managed while some of the others are server-side-managed.
This PR:
cargo xtask ls-apis dag-check
tool that:cargo xtask ls-apis adoc
command to include this information in the summary table. The result is that the table in our wiki will include this information. See oxidecomputer/meta#524 for what this looks like. (I will land that PR after this one lands.)Copying a summary from the README, after this PR:
If I haven't messed this up too badly, this is a pretty big step because:
I hope this will help keep us from accidentally introducing new circular dependencies.
Goals
My main goals here, in order:
How I made the choice for each API
Where possible, we want APIs to be server-side-versioned only so that clients don't have to think about this. See the deck I linked above for more on this.
Initially, I had all APIs marked as
versioned_how = "unknown"
, built the check indag-check
that the "server-side" graph be acyclic, and then built some facilities indag-check
for it to propose that an API be client- or server-managed based on heuristics like "if this API has no deployed consumers, then it can be server-side only" and "if two components depend directly on each other, one of them must be client-side-managed".I also started with the assumption that we'll want APIs that Nexus talks to to be server-side-versioned. That's because control generally flows from Nexus to the other services. It'd be pretty painful if Nexus had to have multiple client versions for every service it talked to. Relatedly, the assumption is that the APIs from Nexus -> other services are generally more complex and volatile than in the other direction.
With this, I iteratively looked at the proposals, checked the visual DAG (from
cargo xtask ls-apis servers --output-format=dot
) and made some judgment calls.The results
On success, the tool prints out a summary of how our APIs are versioned:
In terms of the client-side-managed APIs:
So all of that seems pretty reasonable and everything else was able to be server-side-managed.
Out of curiosity, I did this quick check to see how volatile each API was: in each of the Omicron, Dendrite, Maghemite, Propolis, and Crucible repos, I ran this command to count how many times each of the APIs has been changed in any way in 2024 (almost 11 months):
Putting the results from all repos together and sorting them:
The volatility of the Nexus-internal API, which is unfortunately client-side-managed, makes one wonder if we can rework things so that fewer things need to invoke it (e.g., use long polls in the other direction, or use much simpler APIs that just cause Nexus to reach back out to the other service for details? or in the case of Oximeter, eliminate the use of the API altogether as we've discussed?).
Caveats
This is still a little messy. For example, the metadata has ways to mark both an edge (dependency) and a particular API as client-side or server-side managed.
That's because the tool already supported the edge annotations, but while doing this, I decided it would be way too tedious to figure this out if I had to annotate every edge, when for the vast majority of APIs it's going to be one or the other. We should clean that up in the long term but I think it's a fair bit of work and doesn't help de-risk this plan.
I think there are a few APIs where we might want this choice to be per-edge, not per-API. For example: DPD's API is marked client-managed because of its circular dependency with Sled Agent. That may be unavoidable between those two because there's no way to update DPD before all sled agents, nor to update all sled agents before updating DPD (since both are contained in the host OS deployment unit). However, it should be easy to make sure Nexus (and maybe other DPD consumers) are updated after DPD. And that'd be very valuable. So we may want to say that some of these edges are client-side-managed while some of the others are server-side-managed.