sled-agent: Should `PUT /omicron-zones` fail if it cannot remove a zone?

When sled-agent receives the set of zones it should be running, it attempts to remove any zones that are currently running and are not part of the new set, but failure to remove such zones are only logged and do not affect the result of the API request: https://github.com/oxidecomputer/omicron/blob/13d411f79050703871f4181f5811db8813eb190f/sled-agent/src/services.rs#L3547-L3550

Additionally, sled-agent only attempts to remove a zone exactly once. The first thing zone_bundle_and_try_remove does is remove the zone from the in-memory list: https://github.com/oxidecomputer/omicron/blob/13d411f79050703871f4181f5811db8813eb190f/sled-agent/src/services.rs#L3557-L3564

It then tries to shut down the zone, logs any errors, and (later, after trying to start any new zones), records that in-memory list back into the ledger. That means we have this flow, where even if Nexus resends the PUT sled-agent won't retry the removal:

sled-agent is currently on omicron-zones generation N
Nexus sends a PUT with generation N + 1 that removes one zone
- sled-agent tries to shut down that one zone, but proceeds regardless of success
- sled-agent records that it is now on generation N + 1
Nexus sends another PUT request with generation N + 1; sled-agent thinks it's already there, so does nothing

@sunshowers and I were chatting in the context of reconfigurator / zone cleanup, and we think it may be important to fail this request if any zone removals fail. A motivating example is removing an expunged Nexus and reconfigurator deciding whether any sagas assigned to that Nexus have been reassigned. Reconfigurator needs to check two things:

Is there a guarantee that no new saga assignments will be claimed by the expunged Nexus?
Have any sagas assigned to the expunged Nexus been reassigned?

Critically: seeing that there are no sagas assigned to the expunged Nexus (item 2) alone is insufficient, since it's inherently racy if new assignments could still be made. It must first ensure item 1.

The working plan for reconfigurator is that it will base "has a zone actually been expunged" by inspecting the regularly-polled inventory. sled-agent reports its current omicron-zones config to inventory, but based on the above, this might be a lie if zone removal failed. "Zone expunged and inventory reports zone is gone" should be enough to satisfy item 1 above for Nexus and saga assignment, but isn't if sled-agent lies: the Nexus that sled-agent claims isn't running but actually is could continue to claim new sagas.

The sled-agent behavior was clearly intentional, and I haven't tried to dig into the history to figure out why. Can we change it to fail the PUT (and, critically, not update its ledger) if zone removal fails, instead of only logging?

I think that as long as the zone failing to shutdown doesn't leave any in-memory state around that contradicts that the zone is still running that this should be fine. In other words, the shutdown and state management of the shutdown should be as atomic as possible. If the write to the ledger fails to happen this is still fine because the next update from nexus will hopefully cause it to succeed. If the zone is already down the shutdown attempt should succeed idempotently.

I think the only problem we have to worry about in these cases is if the zone never shuts down or the writes to the ledger continuously fail. In that case we have a problem with the sled altogether.

I agree this is a problem but I think it would be better not to fail the request, but to instead implement APIs such that:

PUT /omicron-zones records the new intended state of things, stores that to disk; sled agent then continually tries to make it reality
GET /omicron-zones (or GET /inventory after #6770) reports the last state that's been realized

We discussed a similar approach under #5086.

You could say that the reconciler loop already exists in Nexus, so why not let this request fail and let Nexus keep trying to get it to the state it's trying to make it. But if you assume this sort of runtime error could come up at any point during zone creation or removal, I don't see how this converges. What if you've successfully removed one zone, then fail to remove another? You're neither in the new state nor the old state.

Underlying this is my assumption that we generally should be able to identify most handleable errors before we make any changes. For example, if the zone config instructs the sled agent to create a dataset on a pool that does not exist, that's something we can check for before we accept the request. The other kinds of problems seem both very unlikely and also arbitrarily bad -- e.g., ENOMEM trying to fork, or some logic bug that causes sled agent to try to zoneadm destroy a zone that doesn't exist. In those cases it's hard to imagine that rolling backwards is any more likely to work than rolling forwards; hence, let's just keep trying to roll forwards until it works (or an operator comes in to do something else).

I think this has long been a theoretical problem -- is there a case that came up in practice where we ran into an error in these paths and wound up barreling on in the wrong state?

I agree this is a problem but I think it would be better not to fail the request, but to instead implement APIs such that:

...

We discussed a similar approach under https://github.com/oxidecomputer/omicron/issues/5086.

Ooh, thanks, I didn't remember 5086. That plan does sound better.

I think this has long been a theoretical problem -- is there a case that came up in practice where we ran into an error in these paths and wound up barreling on in the wrong state?

No, still theoretical - Rain and I were talking about what guarantees the planner has about the current state of the system via the PlanningInput, and it seems to be lacking in some important ones (all variants of "our parent blueprint, which is what we make most planning decisions from, may not have been realized yet (or may never be)"). This was one of them, so I wrote it up while it was fresh.

oxidecomputer / omicron

sled-agent: Should `PUT /omicron-zones` fail if it cannot remove a zone? #6776