oxidecomputer / oxide.rs

The Oxide Rust SDK and CLI
Mozilla Public License 2.0
37 stars 12 forks source link

Cleanup partially imported disks on CTRL+C #760

Open wfchandler opened 1 month ago

wfchandler commented 1 month ago

If a disk import request is interrupted then the disk is likely to be left in a ImportReady or ImportingFromBulkWrites state, which prevents the user from deleting it. The commonly occurs when users interrupt the request using ^C, and is a frustrating experience.

Add a new GracefulShutdown struct to listen for SIGINTs. This will cancel the executing Future on interrupt and run a cleanup task which will return an error, even if cleanup succeeds. A second SIGINT causes the process to exit immediately.

Hook calls make in the critical section of the import into the new system to ensure we delete the partially imported disk.

This introduces a race where the user may cancel the import operation while the disk creation saga is still executing and the disk is in Creating state. Attempting to finalize or delete the disk in this state will fail, so we are forced to wait until it has transitioned to ImportReady before proceeding. To avoid spamming the API too heavily, we poll the disk state every 500ms for no more than ten tries.

The new polling creates a problem for our testing, though. We need the initial disk API query to return a 404, but subsequent ones to find the disk. httpmock does not provide a way to remove a mock after N hits, and we have no graceful way of replacing the mock while running the oxide binary. There is an open issue to add an option like this to httpmock, but for right now we have no option but to delete the tests that check for failure.

Closes https://github.com/oxidecomputer/oxide.rs/issues/651

wfchandler commented 1 month ago

Recording:

https://github.com/user-attachments/assets/c6dccac1-95f6-49d6-914a-5af4bfac5d36

wfchandler commented 1 month ago

Opened https://github.com/alexliesenfeld/httpmock/pull/108 to add a delete_after_n method to httpmock that would allow us to clear the initial mock. In the meantime I've just removed the failing tests in this PR. We can add them back in if/when my change to httpmock is accepted.

ahl commented 1 month ago

I had been moving the disk upload functionality to the SDK. This seems to complicate that. I had imagined we would do cleanup from a custom Drop implementation (albeit synchronously in the absence of async Drop). When we receive a SIGINT we could cancel the future and let the Drop logic do the cleanup.

From a scan through your changes, it seems like there may be obstacles in the API such as needing to wait for particular state transitions. There may be situations where we want to change the API or state machine.

wfchandler commented 1 month ago

I had been moving the disk upload functionality to the SDK. This seems to complicate that. I had imagined we would do cleanup from a custom Drop implementation (albeit synchronously in the absence of async Drop). When we receive a SIGINT we could cancel the future and let the Drop logic do the cleanup.

From a scan through your changes, it seems like there may be obstacles in the API such as needing to wait for particular state transitions. There may be situations where we want to change the API or state machine.

@ahl I'm not sure where that leaves this PR. Should I close it, or do we expect the work to update the API to take long enough it's worthwhile merging this as a stopgap.

ahl commented 1 month ago

I had been moving the disk upload functionality to the SDK. This seems to complicate that. I had imagined we would do cleanup from a custom Drop implementation (albeit synchronously in the absence of async Drop). When we receive a SIGINT we could cancel the future and let the Drop logic do the cleanup. From a scan through your changes, it seems like there may be obstacles in the API such as needing to wait for particular state transitions. There may be situations where we want to change the API or state machine.

@ahl I'm not sure where that leaves this PR. Should I close it, or do we expect the work to update the API to take long enough it's worthwhile merging this as a stopgap.

@wfchandler I was hoping we might change the approach here to do cleanup when we drop the future. Do you think that's viable?

wfchandler commented 2 weeks ago

I had been moving the disk upload functionality to the SDK. This seems to complicate that. I had imagined we would do cleanup from a custom Drop implementation (albeit synchronously in the absence of async Drop). When we receive a SIGINT we could cancel the future and let the Drop logic do the cleanup. From a scan through your changes, it seems like there may be obstacles in the API such as needing to wait for particular state transitions. There may be situations where we want to change the API or state machine.

@ahl I'm not sure where that leaves this PR. Should I close it, or do we expect the work to update the API to take long enough it's worthwhile merging this as a stopgap.

@wfchandler I was hoping we might change the approach here to do cleanup when we drop the future. Do you think that's viable?

@ahl I think the latest revision of this PR should be compatible with your SDK changes. We can replace self.upload_disk with the equivalent SDK command when that's available. WDYT?