disk creation saga needs different mechanism to wait on creation

davepacheco commented 2 years ago

In the last two Friday demos we tried a disk creation that appeared to work, but Nexus's state said "creating" for way longer than expected (minutes). The reason appeared to be exponential backoff in the saga, with each timeout triggering an attempt to create the disk again (using a request that would succeed if the disk were already created): https://github.com/oxidecomputer/omicron/blob/fd3dab12f1cbfdd9268449319e3db5eda8257136/nexus/src/sagas.rs#L1228-L1271

I don't think it makes sense to use exponential backoff for this case because there's no indication that the remote side is overloaded -- it just isn't finished yet. The internal_service_policy() being used at L1266 is intended for background connections or requests to internal services where we want to retry indefinitely -- definitely nothing latency-sensitive. We could create one that's much more aggressive (e.g., once/second), but really, this isn't retrying a failure, it's just waiting for something to happen. There are other ways to do this:

For instance creation, we have the sled send us a notification when the instance state changes. I like this pattern because it also works for the case where the state changes unexpectedly (e.g., a bhyve crash), though I think it creates a different problem if the notification is received by a different Nexus than the one operating the saga.
Josh suggested a long poll -- instead of having the request complete immediately with a status meaning "it's not ready yet", it could wait a bounded amount of time (say, 30 seconds). If the request completes within that, the server responds immediately saying so. If not, it responds with "not ready yet" and the client immediately retries.

smklein commented 2 years ago

For comparison to the instance creation case, we do the following:

Call instance_set_runtime during sic_instance_ensure.
Call instance_put with all the pre-requisite instance information
Within the sled agent, this synchronously starts the instance, and only returns once it completes.

However, we also get updates whenever the instance state changes, even after this initial request. This is because the sled agent itself implements a long-polling call to propolis, which only receives a response on state change.

TL;DR:

I think using a non-exponential backoff poll would be a potential solution to this issue
I think long-polling from Nexus -> Crucible would be a potential solution to this issue
I think this implementation is actually similar to the instance creation saga, as both synchronously wait for completion

davepacheco commented 2 years ago

For comparison to the instance creation case, we do the following:

1. Call [instance_set_runtime](https://github.com/oxidecomputer/omicron/blob/d2bf956eb3d8c74e634668062ed96ae26ac9e566/nexus/src/sagas.rs#L830-L839) during `sic_instance_ensure`.

2. Call [instance_put](https://github.com/oxidecomputer/omicron/blob/d2bf956eb3d8c74e634668062ed96ae26ac9e566/nexus/src/nexus.rs#L1884-L1894) with all the pre-requisite instance information

3. Within the sled agent, this _synchronously_ starts the instance, and only returns once it completes.

Hmm. When I did the initial work on the mock sled agent, the intent of instance_put was to return as soon as possible with an acknowledgement of the request and an intermediate state (usually Starting). Then subsequently Sled Agent would emit a notification (in the form of an HTTP request back to Nexus) when the VM changed state (usually to Running). The reason is mainly that in my experience, VM boot can take an arbitrarily long time when things aren't working well, and that's when you want a clear understanding of status and the ability to keep asking about it, both at the Nexus and Sled Agent levels. Also, TCP/HTTP aren't that well-suited to arbitrarily-long requests -- connections remain in use for the duration, transient failures disrupt them and the client doesn't know what the state is (not a big deal here because it's idempotent), it's hard to tell if the server has just forgotten about it, etc.

When you say that the initial instance_put request synchronously starts the instance, do you mean that it waits for it to reach "running"? That'd be different from what I think the mock sled agent does, and I think that'd be a problem for these reasons.

However, we also get updates whenever the instance state changes, even after this initial request. This is because the sled agent itself implements a long-polling call to propolis, which only receives a response on state change.

TL;DR:
* I think using a non-exponential backoff poll would be a potential solution to this issue

While I agree that non-exponential polling would be an improvement, it still adds significant unnecessary latency to every provision. More than this specific case, I'm worried we'll replicate it elsewhere too. This feels like how you wind up with provisions taking 20 seconds, most of which are spent sleeping.

* I think long-polling from Nexus -> Crucible would be a potential solution to this issue

* I think this implementation is actually similar to the instance creation saga, as both synchronously wait for completion

Aren't we here because this implementation uses exponential backoff polling, not a synchronously request?

smklein commented 2 years ago

Here's the callstack on the sled agent side of things:

Within InstanceManager::ensure, this part is particularly notable:

https://github.com/oxidecomputer/omicron/blob/d2bf956eb3d8c74e634668062ed96ae26ac9e566/sled-agent/src/instance_manager.rs#L158-L168

Instance::start creates a Propolis Zone, starts the Propolis service, sends an "instance_ensure" request to the Propolis server.

This is "synchronous" in the sense that it waits for the Propolis server to be up and running, and for it to receive the request. It's "asynchronous", I suppose, in the sense that we don't actually wait for it to be "running", just "created".

smklein commented 2 years ago

Part of the reason "instance monitoring" works well on the instance side of things - where the sled agent long-polls into propolis, and notifies nexus of state changes - is that the lifetime of an "instance" is always a subset of the "sled agent" it's running on (modulo live migration, but in this context I'd consider that to be a "different instance").

In comparison, if we create a disk, it may exist on crucible downstairs services spread across multiple sleds. The lifetime is not coupled with an instance, and it makes less sense for any particular sled to be responsible for polling the disk status. Theoretically Nexus could be constantly long-polling to all disks in perpetuity, but that seems like a fairly high cost.

Alternatively, the underlying crucible downstairs service could make an upcall to Nexus, triggering state change updates, without an intermediary?

davepacheco commented 2 years ago

I'm confused about what we're talking about now -- I thought we were talking about requests that Nexus makes to Crucible Agent while creating a disk? That's different from general monitoring of disk state (at least potentially), and I'm not sure why sled agent would be the client?

smklein commented 2 years ago

In the context of "what can we do for disk creation", yes, we could definitely just have Nexus long-poll to Crucible once, and have no further mechanism for receiving notifications about state changes.

However, I thought you were suggesting something akin to the instance creation saga, where we would return when the disk is in the "creating" state, and update it to the "running" state at some later point in time?

davepacheco commented 2 years ago

Yeah, so for that case, I would think whatever was responsible for doing those transitions would call back to Nexus.

Would it make sense to schedule a discussion? Or is this feeling like an unhelpful line of discussion?

oxidecomputer / omicron

disk creation saga needs different mechanism to wait on creation #996