oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
250 stars 39 forks source link

[API] `add_sled_to_initialized_rack` and `uninitialized_sled_list` need some work #4607

Closed ahl closed 10 months ago

ahl commented 10 months ago

Here are the current system/hardware operations:

API operations found with tag "system/hardware"   
OPERATION ID                             METHOD   URL PATH
add_sled_to_initialized_rack             POST     /v1/system/hardware/sleds
networking_switch_port_apply_settings    POST     /v1/system/hardware/switch-port/{port}/settings
networking_switch_port_clear_settings    DELETE   /v1/system/hardware/switch-port/{port}/settings
networking_switch_port_list              GET      /v1/system/hardware/switch-port
physical_disk_list                       GET      /v1/system/hardware/disks
rack_list                                GET      /v1/system/hardware/racks
rack_view                                GET      /v1/system/hardware/racks/{rack_id}
sled_instance_list                       GET      /v1/system/hardware/sleds/{sled_id}/instances
sled_list                                GET      /v1/system/hardware/sleds
sled_physical_disk_list                  GET      /v1/system/hardware/sleds/{sled_id}/disks
sled_set_provision_state                 PUT      /v1/system/hardware/sleds/{sled_id}/provision-state
sled_view                                GET      /v1/system/hardware/sleds/{sled_id}
switch_list                              GET      /v1/system/hardware/switches
switch_view                              GET      /v1/system/hardware/switches/{switch_id}
uninitialized_sled_list                  GET      /v1/system/hardware/uninitialized-sleds

1. I think that uninitialized_sled_list may be a little out of place (but it might be fine). One option is to have a query param to sled_list that determines whether all sleds, initialized sleds, uninitialized sleds, both/all/whatever are listed. Another is the a-grammatical sled_uninitialized_list.

2. The operation returns an array of items; it should probably be paginated.

3. add_sled_to_initialized_rack is similarly different. I'd suggest sled_add_uninitialized or perhaps simply sled_add. Note also that this operation is /v1/system/hardware/sleds; would it make sense for uninitialized_sled_list to have a more consistent path for consistency with this operation on uninitialized sleds?

david-crespo commented 10 months ago
  1. I like query param on sled_list if it really is just a filter on that list
  2. Yes, or at least fake-paginated, as in, a ResultsPage where next_page or whatever it's called is always null
  3. I am in favor of sled_add.
zephraph commented 10 months ago

I agree w/ everything @david-crespo said

  1. By default I'd make sled_list return all sleds with an optional filter that if specified returns only initialized or not initialized.
  2. Yes for pagination, but I'd do it for real
  3. sled_add with detailed docs sounds much better

Worth considering if sled_set_provision_state follows the pattern too. Usually the action comes at the end so I'd expect it to be sled_provision_state_set (if set is the right verb).

ahl commented 10 months ago

Adding to this:

I believe that in order to add an uninitialized sled, one needs to specify:

I guess it's not that bad because this is the output from uninitialized_sled_list, but should all be strictly required?

Do we need all of these things? for the sled_view operation we only need a sled ID. While we don't have that yet, do we really need all of those fields to uniquely identify the sled?

davepacheco commented 10 months ago

Thanks for looking at this. It makes sense to me.

  1. I think we're suffering a bit because we have "sled" the hardware device and "sled" the software thing that the control plane knows about. By calling the second thing a sled, we have [hardware] sleds with no [software] sled id etc and that makes this whole area a little awkward. "Uninitialized sleds" aren't "sleds with some null value for some field" but a different thing altogether. This affects (1) because I'm not sure how easy it is to have a sled_list that lists both, at least not without significantly changing it. I guess it could list the latest inventory collection instead, and "join" against the sled table if asking for initialized sleds? Might we ever want different payloads for these things?

  2. I discussed pagination with Andrew on the original PR and we felt at the time that paginating for real would be quite a lot more work and that would not be a problem for quite a while. Fake pagination seems fine. I imagine we'll want to cleanly handle the case where we are given a page token, even though we'd never have issued one. We could just return a 400.

  3. sled_add seems fine.

Worth considering if sled_set_provision_state follows the pattern too. Usually the action comes at the end so I'd expect it to be sled_provision_state_set (if set is the right verb).

I think I generally use subject_verb_object (like project_instance_create), not verb-last. I'm not sure what we've done in most places. I think we haven't written down the style here so I imagine folks may have inferred different things.

davepacheco commented 10 months ago

I believe that in order to add an uninitialized sled, one needs to specify:

* baseboard

  * part
  * revision
  * serial number

* rack ID

* cubby integer

I guess it's not that bad because this is the output from uninitialized_sled_list, but should all be strictly required?

Do we need all of these things? for the sled_view operation we only need a sled ID. While we don't have that yet, do we really need all of those fields to uniquely identify the sled?

I believe that either of these two would uniquely identify the sled that the user wants to add:

I think I would pick the baseboard part/serial because in principle that's what an operator has verified against some inventory sheet or something.

Are you figuring that you'd change this API to accept a new params::UninitializedSledId type (different from views::UninitializedSled) and that could have only the baseboard part and serial number? I think that'd work.

In terms of of the implementation: we do use the rack id in Nexus::add_sled_to_initialized_rack in order to get the rack subnet so that we can allocate an IP subnet for this sled. Currently, we're getting the rack id from the client, who got it from uninitialized_sled_list, which just uses self.rack_id (on the Nexus object). If we do the above (leave rack id out of the arguments to this API) we use self.rack_id inside add_sled_to_initialized_rack as well. When we actually do have support for multiple racks, the inventory system will know which rack id each baseboard is part of, and we'll be able to get the rack id that way.

ahl commented 10 months ago

1. the sled_list I proposed isn't the right approach. In the CLI I made this oxide system hardware sled list-uninitialized so maybe that argues for sled_list_uninitialized. I don't know that there's a great answer here.

2. I think we just do dummy pagination as discussed above. The benefits of this are A. SDK generation B. CLI generation C. maybe some future-proofing?

3. add_sled_to_initialized_rack -> sled_add

4. I think Dave proposed using part and serial number as a 2-tuple to identify the sled we want to add. That sounds good to me.

andrewjstone commented 10 months ago

the sled_list I proposed isn't the right approach. In the CLI I made this oxide system hardware sled list-uninitialized so maybe that argues for sled_list_uninitialized. I don't know that there's a great answer here.

I think we just do dummy pagination as discussed above. The benefits of this are A. SDK generation B. CLI generation C. maybe some future-proofing?

add_sled_to_initialized_rack -> sled_add

I think Dave proposed using part and serial number as a 2-tuple to identify the sled we want to add. That sounds good to me.

Thanks for working through all the details. I'm going to start on this right meow.

andrewjstone commented 10 months ago

@davepacheco So far in our API, we always take the revision in addition to serial and part number, since we use Baseboard in UninitializedSled as well as Sled. Furthermore, sled-agent expects revision because that's what is defined in sled_hardware::Baseboard. The only place we don't use revision is in the inventory BaseboardId. I recognize why this is and it makes sense. However, it seems silly to me to define yet another API type and then look up the revision from the SP in the latest inventory collection by grabbing the inventory rather than just passing the Baseboard directly as returned from listing uninitialized sleds. It would be pretty odd for the revision to change in between listing and adding a sled 😄.

Does this work for you?

davepacheco commented 10 months ago

I'm having trouble keeping all this straight so tell me if this summary is right.

What I understood from @rmustacc was that:

And as you said:

And you're saying (/I think I remember):

Putting this all together:

Is that all correct?

I don't think there's any practical problem with doing what you suggest. But I think this goes back to @ahl's earlier observation that it feels a little weird to include unnecessary information when we just want an identifier for the sled. It seems similar to me as if we included the number of hardware threads or DIMM slots here (which I could also imagine adding to the Sled and UninitializedSled views). If it's just a question of adding an extra type (which sounds like more precisely the right thing anyway) and an extra function call to fetch the inventory, I think I'd just do that? I'd probably also see if the sled agent API really needs the revision, too, or if we could get rid of it there. I don't feel strongly about this. @ahl?

ahl commented 10 months ago

What's the workflow for adding a new sled? In the console, who cares: users are going to click click around and they don't care about the API. What about the CLI? I think we're imagining something like this:

$ oxide sled list-uninitialized
...
<find the one I want>
$ oxide sled add <parameters on based on the whole or subset of the info above>

If that's the case, let's find some relatively minimal amount of data the user needs to copy and paste (some goldilocks zone of enough information to be right and be fat-finger-tolerant and not too much redundant information to be annoying).

However, it seems silly to me to define yet another API type and then look up the revision from the SP in the latest inventory collection by grabbing the inventory rather than just passing the Baseboard directly as returned from listing uninitialized sleds.

The added complexity of a new type seems less important than reducing complexity to CLI users. I don't think we're going to see a lot of customer- or third-party-built API automation around this particular workflow. This is a highly revisable decision; it's fine to just do something and figure out something better later (e.g. maybe operators will want to slide in a sled, and then just oxide sled add --part xyz --serial abc because they're in the DC and wrote them down; or oxide sled add --rack-id xyz --cubby 12 because they're looking at the rack and the new part is right there in cubby 12!).

andrewjstone commented 10 months ago

@davepacheco Everything you wrote was correct. Except I wasn't arguing to pass UninitializedSled to add-sled, but instead to use the Baseboard type that already exists. So the only "extra" thing we'd pass in is revision. I implemented this in https://github.com/oxidecomputer/omicron/pull/4704.

I'm fine changing the implementation to do what we said above and add a new type UninitializedSledId that doesn't include the revision as that probably does make the most sense for the API. I think @ahl makes a great point about simplifying the CLI this way.

Unfortunately, changing the sled-agent and probably also elsewhere in Nexus where we use Baseboard to identify a sled is a larger change. Sled-agent actually persists this information in multiple places. Now it's not that hard to deserialize into a BaseboardId, but there is other information as well such as whether the sled is a gimlet or PC.

I also believe this may have ramifications for trust quorum. For example, if we add a sled with revision 5, and then it gets reworked, should it remain in the trust quorum even though it has been modified? Now, I realize this can be worked around by removing the sled from the trust quorum and then having to add it back in later. Maybe this is fine? What do you think @rmustacc ?

In any case, I'll go ahead and add the new type as a nexus param for this API, as that seems to be the consensus and that's what we're worried about right now.

andrewjstone commented 10 months ago

Completed with https://github.com/oxidecomputer/omicron/pull/4704