Instance start/delete sagas hang while sleds are unreachable

gjcolombo commented 11 months ago

Repro environment: Seen on rack3 after a sled trapped into kmdb and became inoperable.

When an instance starts, the start saga calls Nexus::create_instance_v2p_mappings to ensure that every sled in the cluster knows how to route traffic directed to the instance's virtual IPs. This function calls sled_list to get the list of active sleds and then invokes the sled agent's set_v2p endpoint on each one. The calls to set_v2p are wrapped in a retry_until_known_result wrapper that treats Progenitor communication errors (including client timeouts) as transient errors requiring an operation to be retried (consider, for example, a request to do X that sled agent receives and begins processing but does not finish processing until Nexus has decided not to wait anymore; if this produces an error that unwinds the saga, X will not be undone, because a failure in a saga only undoes steps that previously completed successfully, not the one that produced the failure). Instance deletion does something similar to all this via delete_instance_v2p_mappings.

Rack3 has a sled that keeps panicking with symptoms of a known host OS issue. To better identify the problem, we set the sled up to drop into kmdb on panicking instead of rebooting. This rendered the sled's agent totally and permanently unresponsive. Since retry_until_known_result treats progenitor_client::Error::CommunicationErrors as transient errors, this caused all subsequent instance creation and deletion attempts to get stuck retrying the same attempt to edit V2P mappings on the sled being debugged, causing the relevant instances to get stuck in the Creating/Stopped states (soon to be the Starting/Stopped states once 4194 lands).

There are several things to unpack here (probably into their own issues):

retry_until_known_result doesn't have a way to bail out after a certain amount of time/number of attempts; even if it did, such a bailout would have to respect the undo rules for sagas described above
There's no way to recover when an instance gets stuck in a creating/starting state: the instance can't be stopped or destroyed, and even if these state transitions were allowed or there were some other way to "reset" the instance to a stopped state (see #4004), there's no way to cooperatively cancel the instance's ongoing saga.
Recovering from the "bad sled" case is very challenging:
- There's currently no way to mark a sled as unhealthy or out-of-service; there's a time_deleted column in the sleds table, but the datastore's sled_list function doesn't filter on it, so create_instance_v2p_mappings won't ignore deleted sleds.
- Even if sled_list did ignore unhealthy sleds, there's still a race where create_instance_v2p_mappings decides to start talking to a sled before it's marked as unhealthy and never reconsiders that decision. (This feeds back into the first two items in this list.)

morlandi7 commented 9 months ago

@internet-diglett Is there some alignment here with the other RPW work? (#4715 )

internet-diglett commented 8 months ago

@internet-diglett Is there some alignment here with the other RPW work? (#4715 )

@morlandi7 at the moment, I don't believe so unless there has been a decision to move v2p mappings to an RPW based model. @gjcolombo have there been any discussions about this?

askfongjojo commented 8 months ago

@internet-diglett - Perhaps a useful fix for now is modify the sled inclusion criteria to consider the time_deleted value. The change seems valid regardless of how we want to handle unresponsive sleds in general.

In situations like OS panic or sled-agent restart, we've seen in one customer's case that the saga was able to resume/complete once the problem sled came back up (not ideal but also not too bad). There are cases in which the sleds are out indefinitely but we'll take the necessary time to solve them through other ways.

davepacheco commented 8 months ago

Is there anything in the system that sets time_deleted on a sled today? I wouldn't have thought so. I'd suggest we use the policy field proposed in RFD 457 instead but that's basically the same idea, and I think it's a good idea, but has the same problem that I don't think it would help this problem in practice until we actually implement support for sled removal. It'd be tempting to use provision_state, but I don't think that's quite right because there might still be instances on a sled to which provisioning new instances is currently disabled.

(The rest of this is probably repeating what folks already know but it took me a while to understand the discussion above so I'm summarizing here for myself and others that might be confused.)

I think it's important to distinguish three cases:

sleds that are transiently unavailable,
sleds that are unavailable for an extended period (say, more than a few minutes) but we don't know if they're coming back, and
sleds that are permanently unavailable (which means an operator has told us that it's not coming back)

I can see how if a sled is unreachable for several minutes, we don't want all instance start/stop for instances on that sled to hang, and certainly not all start/stop for all instances. But we also don't want to give up forever. It might still have instances on it, it might come back, and it may need that v2p update, right? So I can see why we're asking about an RPW. I'm not that familiar with create_instance_v2p_mappings, but yeah, it sounds like an RPW may well be a better fit than a saga step. The RPW would do its best to update all sleds that it can. But if it can't reach some, no sweat -- it'll try again the next time the RPW is activated. And we can use the same pattern we use with other RPWs to report status (e.g., to omdb) about which sleds we've been able to keep updated to which set of v2p mappings.

davepacheco commented 8 months ago

@gjcolombo would you object to retitling this Instance start/delete sagas hang while sleds are unreachable? (edit: confirmed no objection offline)

davepacheco commented 5 months ago

In today's update call, we discussed whether this was a blocker for R8. The conclusion is "no" because this should not be made any worse during sled expungement. The sled we plan to expunge in R8 is not running any instances and so should not need to have its v2p mappings updated as part of instance create/delete sagas. Beyond that, all instances are generally stopped before the maintenance window starts, and when they start again, the sled will be expunged and so not included in the list of sleds to update.

internet-diglett commented 3 months ago

@morlandi7 this should be resolved, but I left it open until someone verifies the work done in #5568 has actually resolved this issue on dogfood.

askfongjojo commented 2 months ago

Checked the current behavior on rack2: I put sled 23 to A2 and provisioned a bunch of instances. All of them stayed in starting state (they didn't transition to running after the sled was brought back to A0 and that's a different problem to be investigated).

According to https://github.com/oxidecomputer/omicron/blob/main/nexus/src/app/sagas/instance_start.rs#L61-L62, which in turn references #3879, it looks like fixing this requires one (hopefully small) lift.

internet-diglett commented 2 months ago

@askfongjojo I think that is an old comment that didn't get removed, as that saga node already has been updated (through a series of function calls) to use the nat rpw. Do you have the instance ids / any identifying information so I can check the logs to see what caused it to hang?

askfongjojo commented 2 months ago

Ah you are right. I retested just now and had no problem bringing up instances when one of the sled is offline. I probably ran into an issue related to some bad downstairs when I tested that last time. This time I'm testing with a brand new disk snapshot to avoid hitting the bad downstairs problem again.

oxidecomputer / omicron

Instance start/delete sagas hang while sleds are unreachable #4259