Instance start failure due to inability to place VMM should be plumbed up to the instance create API response

oxidecomputer / omicron

Omicron: Oxide control plane

Mozilla Public License 2.0

251 stars 39 forks source link

Instance start failure due to inability to place VMM should be plumbed up to the instance create API response #5122

Open askfongjojo opened 8 months ago

askfongjojo commented 8 months ago

When the control plane cannot find a sled with the required vcpu and memory to place a VMM, instance start would fail with a HTTP 507 InsufficientCapacity error. This error is returned to the API caller only during the instance start operation but not the instance create operation. Currently, an instance create request that successfully provisions the VM but fails to start it up still returns a HTTP 200 response, resulting in confusion to the user that the VM is ready to be used when it's not.

askfongjojo commented 8 months ago

Talked to the customer just now. They expressed that if the change causes the instance start operation to be a synchronous part of instance create, it is also not ideal. The current behavior is acceptable to them and they can simply adjust their workflow to do an extra check on instance state after provisioning.

askfongjojo commented 8 months ago

I checked the current behavior again and it looks like the instance create saga does wait for the instance start saga to complete but the saga is considered done as soon as sled-agent acknowledges the start request. In other words, it doesn't wait till the instance moves to "running" state before declaring the instance start is done.

In the cases I've attempted, a successful instance start results in a "starting" run_state in the API response whereas a failed instance start results in a "stopped" run_state. With that said, there may be a timing element involved which makes the run_state value not definitive enough to distinguish the two cases.

zephraph commented 8 months ago

There's another thing to consider here: If the response was plumbed through the API (e.g. a 400 or 500 error code happened when the user was trying to create and start and instance but it failed to start) what would we want to do with the state of the system? If that meant we unwound the creation of the instance then the user would have to start all over with the creation. Doesn't seem like that would be a good user experience.

Ultimately creating and instance and starting an instance are two separate processes and I think the interaction that we have today is the way it should work. The real thing we're missing is alerting the user to the fact that some async process they kicked off failed.

askfongjojo commented 8 months ago

If that meant we unwound the creation of the instance then the user would have to start all over with the creation. Doesn't seem like that would be a good user experience.

Agreed. The intent of this request is to pass on the error message without unwinding the create request. But this may be an anti-pattern of the saga workflow (i.e. any event failure would normally trigger the unwind workflow), causing the implementation more difficult/awkward than expected. Maybe we can pass back a HTTP 207 in this case to indicate partial success along with an appropriate error message?