Open davepacheco opened 2 years ago
For comparison to the instance creation case, we do the following:
sic_instance_ensure
.However, we also get updates whenever the instance state changes, even after this initial request. This is because the sled agent itself implements a long-polling call to propolis, which only receives a response on state change.
TL;DR:
For comparison to the instance creation case, we do the following:
1. Call [instance_set_runtime](https://github.com/oxidecomputer/omicron/blob/d2bf956eb3d8c74e634668062ed96ae26ac9e566/nexus/src/sagas.rs#L830-L839) during `sic_instance_ensure`. 2. Call [instance_put](https://github.com/oxidecomputer/omicron/blob/d2bf956eb3d8c74e634668062ed96ae26ac9e566/nexus/src/nexus.rs#L1884-L1894) with all the pre-requisite instance information 3. Within the sled agent, this _synchronously_ starts the instance, and only returns once it completes.
Hmm. When I did the initial work on the mock sled agent, the intent of instance_put
was to return as soon as possible with an acknowledgement of the request and an intermediate state (usually Starting
). Then subsequently Sled Agent would emit a notification (in the form of an HTTP request back to Nexus) when the VM changed state (usually to Running
). The reason is mainly that in my experience, VM boot can take an arbitrarily long time when things aren't working well, and that's when you want a clear understanding of status and the ability to keep asking about it, both at the Nexus and Sled Agent levels. Also, TCP/HTTP aren't that well-suited to arbitrarily-long requests -- connections remain in use for the duration, transient failures disrupt them and the client doesn't know what the state is (not a big deal here because it's idempotent), it's hard to tell if the server has just forgotten about it, etc.
When you say that the initial instance_put
request synchronously starts the instance, do you mean that it waits for it to reach "running"? That'd be different from what I think the mock sled agent does, and I think that'd be a problem for these reasons.
However, we also get updates whenever the instance state changes, even after this initial request. This is because the sled agent itself implements a long-polling call to propolis, which only receives a response on state change.
TL;DR:
* I think using a non-exponential backoff poll would be a potential solution to this issue
While I agree that non-exponential polling would be an improvement, it still adds significant unnecessary latency to every provision. More than this specific case, I'm worried we'll replicate it elsewhere too. This feels like how you wind up with provisions taking 20 seconds, most of which are spent sleeping.
* I think long-polling from Nexus -> Crucible would be a potential solution to this issue * I think this implementation is actually similar to the instance creation saga, as both synchronously wait for completion
Aren't we here because this implementation uses exponential backoff polling, not a synchronously request?
Here's the callstack on the sled agent side of things:
Within InstanceManager::ensure
, this part is particularly notable:
Instance::start
creates a Propolis Zone, starts the Propolis service, sends an "instance_ensure" request to the Propolis server.
This is "synchronous" in the sense that it waits for the Propolis server to be up and running, and for it to receive the request. It's "asynchronous", I suppose, in the sense that we don't actually wait for it to be "running", just "created".
Part of the reason "instance monitoring" works well on the instance side of things - where the sled agent long-polls into propolis, and notifies nexus of state changes - is that the lifetime of an "instance" is always a subset of the "sled agent" it's running on (modulo live migration, but in this context I'd consider that to be a "different instance").
In comparison, if we create a disk, it may exist on crucible downstairs services spread across multiple sleds. The lifetime is not coupled with an instance, and it makes less sense for any particular sled to be responsible for polling the disk status. Theoretically Nexus could be constantly long-polling to all disks in perpetuity, but that seems like a fairly high cost.
Alternatively, the underlying crucible downstairs service could make an upcall to Nexus, triggering state change updates, without an intermediary?
I'm confused about what we're talking about now -- I thought we were talking about requests that Nexus makes to Crucible Agent while creating a disk? That's different from general monitoring of disk state (at least potentially), and I'm not sure why sled agent would be the client?
In the context of "what can we do for disk creation", yes, we could definitely just have Nexus long-poll to Crucible once, and have no further mechanism for receiving notifications about state changes.
However, I thought you were suggesting something akin to the instance creation saga, where we would return when the disk is in the "creating" state, and update it to the "running" state at some later point in time?
Yeah, so for that case, I would think whatever was responsible for doing those transitions would call back to Nexus.
Would it make sense to schedule a discussion? Or is this feeling like an unhelpful line of discussion?
In the last two Friday demos we tried a disk creation that appeared to work, but Nexus's state said "creating" for way longer than expected (minutes). The reason appeared to be exponential backoff in the saga, with each timeout triggering an attempt to create the disk again (using a request that would succeed if the disk were already created): https://github.com/oxidecomputer/omicron/blob/fd3dab12f1cbfdd9268449319e3db5eda8257136/nexus/src/sagas.rs#L1228-L1271
I don't think it makes sense to use exponential backoff for this case because there's no indication that the remote side is overloaded -- it just isn't finished yet. The internal_service_policy() being used at L1266 is intended for background connections or requests to internal services where we want to retry indefinitely -- definitely nothing latency-sensitive. We could create one that's much more aggressive (e.g., once/second), but really, this isn't retrying a failure, it's just waiting for something to happen. There are other ways to do this: