oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
250 stars 39 forks source link

503 error when snapshotting disks attached to stopped instances #3289

Closed askfongjojo closed 6 months ago

askfongjojo commented 1 year ago

I think I have seen the 503 before but mistaken that as a control plane issue:

$ oxide snapshot create --name loadgen-snap-stoppedvm --disk loadgen --description "disk attached to vm" --project try
error
Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "7c2b9888-1d8f-43c7-8dd7-8bec8167ca8f", "content-length": "133", "date": "Sun, 04 Jun 2023 06:29:02 GMT"}; value: Error { error_code: Some("ServiceNotAvailable"), message: "Service Unavailable", request_id: "7c2b9888-1d8f-43c7-8dd7-8bec8167ca8f" }

(Once I started up the VM, I was able to create snapshots for both of the disks attached to this vm on rack2.)

If snapshot is prohibited on disks attached to stopped instances, we should probably prevent the snapshot action with a more explicit error. A 503 response would imply that the service is only temporarily unavailable and user may retry the action.

If snapshots should be allowed on disks, then it is a functional issue.

jmpesp commented 1 year ago

In this case, it's a functional issue. Taking a snapshot of an attached disk is allowed.

jmpesp commented 1 year ago

Here's the bug in the snapshot create saga:

https://github.com/oxidecomputer/omicron/blob/c2de4805acfd0c20bef9d87afacac9f8e821ad87/nexus/src/app/sagas/snapshot_create.rs#L820-L853

The disk state here is expected to be Detached, but if the disk is attached to a stopped instance this match will return 503. The part of Nexus that checks whether or not the Pantry should be used to take a snapshot says to use the Pantry if the instance is stopped:

https://github.com/oxidecomputer/omicron/blob/c2de4805acfd0c20bef9d87afacac9f8e821ad87/nexus/src/app/snapshot.rs#L106-L126

There needs to be more work here: at the minimum, the disk's state changes to Maintenance as part of this saga, and this has to work with (read: block) the instance from starting. This may not be a candidate for FCS though, due to the workaround of starting the instance existing?

askfongjojo commented 1 year ago

Thanks for root-causing/sizing this. Let's re-target this to MVP give the effort and impact involved. I'll document this known issue in the release notes because users will likely want to create clean snapshots on stopped instances (so that things such as temporary files locks created by running applications won't be part of the snapshot).

askfongjojo commented 1 year ago

Want to note that a customer indicated that they would like to see a fix for this issue sooner for the same reason I mentioned above (i.e. their best practice is to create snapshots on stopped instances).