oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

decide whether each API should be server- or client-side-versioned #7138

Open davepacheco opened 15 hours ago

davepacheco commented 15 hours ago

This PR:

Copying a summary from the README, after this PR:

  • All Dropshot/Progenitor APIs in the system are identified with this tool (as far as we know).
  • Every API is annotated with metadata in api-manifest.toml.
  • This metadata specifies whether versioning for the API is managed on the server side exclusively or using client-side versioning as well. We've made this choice for every existing API.
  • There are no cycles in the server-side-managed-API dependency graph.

This tool verifies these properties.

If I haven't messed this up too badly, this is a pretty big step because:

If you add a new API or a new API dependency without adding corresponding metadata, or if that metadata breaks these constraints, you will have to fix this up before you can land the change to Omicron.

I hope this will help keep us from accidentally introducing new circular dependencies.

Goals

My main goals here, in order:

How I made the choice for each API

Where possible, we want APIs to be server-side-versioned only so that clients don't have to think about this. See the deck I linked above for more on this.

Initially, I had all APIs marked as versioned_how = "unknown", built the check in dag-check that the "server-side" graph be acyclic, and then built some facilities in dag-check for it to propose that an API be client- or server-managed based on heuristics like "if this API has no deployed consumers, then it can be server-side only" and "if two components depend directly on each other, one of them must be client-side-managed".

I also started with the assumption that we'll want APIs that Nexus talks to to be server-side-versioned. That's because control generally flows from Nexus to the other services. It'd be pretty painful if Nexus had to have multiple client versions for every service it talked to. Relatedly, the assumption is that the APIs from Nexus -> other services are generally more complex and volatile than in the other direction.

With this, I iteratively looked at the proposals, checked the visual DAG (from cargo xtask ls-apis servers --output-format=dot) and made some judgment calls.

The results

On success, the tool prints out a summary of how our APIs are versioned:

$ cargo xtask ls-apis dag-check
...
Server-managed APIs:

    Clickhouse Cluster Admin for Keepers (clickhouse-admin-keeper-client, exposed by omicron-clickhouse-admin)
    Clickhouse Cluster Admin for Servers (clickhouse-admin-server-client, exposed by omicron-clickhouse-admin)
    Clickhouse Single-Node Cluster Admin (clickhouse-admin-single-client, exposed by omicron-clickhouse-admin)
    CockroachDB Cluster Admin (cockroach-admin-client, exposed by omicron-cockroach-admin)
    Crucible Agent (crucible-agent-client, exposed by crucible-agent)
    Crucible Control (for testing only) (crucible-control-client, exposed by propolis-server)
    Crucible Pantry (crucible-pantry-client, exposed by crucible-pantry)
    Maghemite DDM Admin (ddm-admin-client, exposed by ddmd)
    DNS Server (dns-service-client, exposed by dns-server)
    Management Gateway Service (gateway-client, exposed by omicron-gateway)
    Maghemite MG Admin (mg-admin-client, exposed by mgd)
    External API (oxide-client, exposed by omicron-nexus)
    Oximeter (oximeter-client, exposed by oximeter-collector)
    Propolis (propolis-client, exposed by propolis-server)
    Sled Agent (sled-agent-client, exposed by omicron-sled-agent)
    Wicketd (wicketd-client, exposed by wicketd)

Client-managed API:

    Bootstrap Agent (bootstrap-agent-client, exposed by omicron-sled-agent)
        reason: depends on itself (i.e., instances call each other)
    Dendrite DPD (dpd-client, exposed by dpd)
        reason: Sled Agent calling DPD creates a circular dependency
    Wicketd Installinator (installinator-client, exposed by wicketd)
        reason: client is provided implicitly by the operator
    Nexus Internal API (nexus-client, exposed by omicron-nexus)
        reason: Circular dependencies between Nexus and other services
    Crucible Repair (repair-client, exposed by crucible-downstairs)
        reason: depends on itself (i.e., instances call each other)
    Repo Depot API (repo-depot-client, exposed by omicron-sled-agent)
        reason: depends on itself (i.e., instances call each other)

APIs with unknown version management: none

In terms of the client-side-managed APIs:

So all of that seems pretty reasonable and everything else was able to be server-side-managed.

Out of curiosity, I did this quick check to see how volatile each API was: in each of the Omicron, Dendrite, Maghemite, Propolis, and Crucible repos, I ran this command to count how many times each of the APIs has been changed in any way in 2024 (almost 11 months):

$ for x in openapi/*.json; do count="$(git log --follow --pretty=%ai $x | grep -c ^2024)"; printf "%3d %s\n" count $x; done

Putting the results from all repos together and sorting them:

  3 openapi/cockroach-admin.json
  3 openapi/crucible-control.json
  3 openapi/crucible-pantry.json
  3 openapi/ddm-admin.json
  3 openapi/dns-server.json
  3 openapi/dsc-control.json
  3 openapi/installinator.json
  3 openapi/repo-depot.json
  4 openapi/clickhouse-admin-server.json
  4 openapi/clickhouse-admin-single.json
  4 openapi/crucible-agent.json
  6 openapi/downstairs-repair.json
  7 openapi/gateway.json
  7 openapi/oximeter.json
 10 openapi/mg-admin.json
 11 openapi/dpd.json
 11 openapi/propolis-server.json
 12 openapi/clickhouse-admin-keeper.json
 22 openapi/bootstrap-agent.json
 24 openapi/wicketd.json
 62 openapi/sled-agent.json
 83 openapi/nexus-internal.json
 87 openapi/nexus.json

The volatility of the Nexus-internal API, which is unfortunately client-side-managed, makes one wonder if we can rework things so that fewer things need to invoke it (e.g., use long polls in the other direction, or use much simpler APIs that just cause Nexus to reach back out to the other service for details? or in the case of Oximeter, eliminate the use of the API altogether as we've discussed?).

Caveats

This is still a little messy. For example, the metadata has ways to mark both an edge (dependency) and a particular API as client-side or server-side managed.

That's because the tool already supported the edge annotations, but while doing this, I decided it would be way too tedious to figure this out if I had to annotate every edge, when for the vast majority of APIs it's going to be one or the other. We should clean that up in the long term but I think it's a fair bit of work and doesn't help de-risk this plan.

I think there are a few APIs where we might want this choice to be per-edge, not per-API. For example: DPD's API is marked client-managed because of its circular dependency with Sled Agent. That may be unavoidable between those two because there's no way to update DPD before all sled agents, nor to update all sled agents before updating DPD (since both are contained in the host OS deployment unit). However, it should be easy to make sure Nexus (and maybe other DPD consumers) are updated after DPD. And that'd be very valuable. So we may want to say that some of these edges are client-side-managed while some of the others are server-side-managed.