Closed gjcolombo closed 2 months ago
@internet-diglett Is there some alignment here with the other RPW work? (#4715 )
@internet-diglett Is there some alignment here with the other RPW work? (#4715 )
@morlandi7 at the moment, I don't believe so unless there has been a decision to move v2p mappings to an RPW based model. @gjcolombo have there been any discussions about this?
@internet-diglett - Perhaps a useful fix for now is modify the sled inclusion criteria to consider the time_deleted
value. The change seems valid regardless of how we want to handle unresponsive sleds in general.
In situations like OS panic or sled-agent restart, we've seen in one customer's case that the saga was able to resume/complete once the problem sled came back up (not ideal but also not too bad). There are cases in which the sleds are out indefinitely but we'll take the necessary time to solve them through other ways.
Is there anything in the system that sets time_deleted
on a sled today? I wouldn't have thought so. I'd suggest we use the policy
field proposed in RFD 457 instead but that's basically the same idea, and I think it's a good idea, but has the same problem that I don't think it would help this problem in practice until we actually implement support for sled removal. It'd be tempting to use provision_state
, but I don't think that's quite right because there might still be instances on a sled to which provisioning new instances is currently disabled.
(The rest of this is probably repeating what folks already know but it took me a while to understand the discussion above so I'm summarizing here for myself and others that might be confused.)
I think it's important to distinguish three cases:
I can see how if a sled is unreachable for several minutes, we don't want all instance start/stop for instances on that sled to hang, and certainly not all start/stop for all instances. But we also don't want to give up forever. It might still have instances on it, it might come back, and it may need that v2p update, right? So I can see why we're asking about an RPW. I'm not that familiar with create_instance_v2p_mappings
, but yeah, it sounds like an RPW may well be a better fit than a saga step. The RPW would do its best to update all sleds that it can. But if it can't reach some, no sweat -- it'll try again the next time the RPW is activated. And we can use the same pattern we use with other RPWs to report status (e.g., to omdb) about which sleds we've been able to keep updated to which set of v2p mappings.
@gjcolombo would you object to retitling this Instance start/delete sagas hang while sleds are unreachable
?
(edit: confirmed no objection offline)
In today's update call, we discussed whether this was a blocker for R8. The conclusion is "no" because this should not be made any worse during sled expungement. The sled we plan to expunge in R8 is not running any instances and so should not need to have its v2p mappings updated as part of instance create/delete sagas. Beyond that, all instances are generally stopped before the maintenance window starts, and when they start again, the sled will be expunged and so not included in the list of sleds to update.
@morlandi7 this should be resolved, but I left it open until someone verifies the work done in #5568 has actually resolved this issue on dogfood.
Checked the current behavior on rack2: I put sled 23 to A2 and provisioned a bunch of instances. All of them stayed in starting
state (they didn't transition to running
after the sled was brought back to A0 and that's a different problem to be investigated).
According to https://github.com/oxidecomputer/omicron/blob/main/nexus/src/app/sagas/instance_start.rs#L61-L62, which in turn references #3879, it looks like fixing this requires one (hopefully small) lift.
@askfongjojo I think that is an old comment that didn't get removed, as that saga node already has been updated (through a series of function calls) to use the nat rpw. Do you have the instance ids / any identifying information so I can check the logs to see what caused it to hang?
Ah you are right. I retested just now and had no problem bringing up instances when one of the sled is offline. I probably ran into an issue related to some bad downstairs when I tested that last time. This time I'm testing with a brand new disk snapshot to avoid hitting the bad downstairs problem again.
Repro environment: Seen on rack3 after a sled trapped into kmdb and became inoperable.
When an instance starts, the start saga calls
Nexus::create_instance_v2p_mappings
to ensure that every sled in the cluster knows how to route traffic directed to the instance's virtual IPs. This function callssled_list
to get the list of active sleds and then invokes the sled agent'sset_v2p
endpoint on each one. The calls toset_v2p
are wrapped in aretry_until_known_result
wrapper that treats Progenitor communication errors (including client timeouts) as transient errors requiring an operation to be retried (consider, for example, a request to do X that sled agent receives and begins processing but does not finish processing until Nexus has decided not to wait anymore; if this produces an error that unwinds the saga, X will not be undone, because a failure in a saga only undoes steps that previously completed successfully, not the one that produced the failure). Instance deletion does something similar to all this viadelete_instance_v2p_mappings
.Rack3 has a sled that keeps panicking with symptoms of a known host OS issue. To better identify the problem, we set the sled up to drop into kmdb on panicking instead of rebooting. This rendered the sled's agent totally and permanently unresponsive. Since
retry_until_known_result
treatsprogenitor_client::Error::CommunicationError
s as transient errors, this caused all subsequent instance creation and deletion attempts to get stuck retrying the same attempt to edit V2P mappings on the sled being debugged, causing the relevant instances to get stuck in the Creating/Stopped states (soon to be the Starting/Stopped states once 4194 lands).There are several things to unpack here (probably into their own issues):
retry_until_known_result
doesn't have a way to bail out after a certain amount of time/number of attempts; even if it did, such a bailout would have to respect the undo rules for sagas described abovetime_deleted
column in the sleds table, but the datastore'ssled_list
function doesn't filter on it, socreate_instance_v2p_mappings
won't ignore deleted sleds.sled_list
did ignore unhealthy sleds, there's still a race wherecreate_instance_v2p_mappings
decides to start talking to a sled before it's marked as unhealthy and never reconsiders that decision. (This feeds back into the first two items in this list.)