Open jordanhendricks opened 1 year ago
@jordanhendricks Just to check, is that the entire unabridged server log or just a snippet? This reminds me a bit of #302 (Crucible was slow to pause during clean VM shutdown), but there we had some extra clues in the logs (messages about state driver-led state transitions and device pause requests) that helped point us to the problem.
the rest of the server log:
This is indeed the cousin of #302. First we try to pause all entities:
Apr 23 20:26:11.813 INFO State worker handling event, event: External(Stop), component: vm_state_worker
Apr 23 20:26:11.815 INFO Sending pause request to block-crucible-0cedae45-3d6e-4d90-b2cb-56f1a1a42a89, component: vm_controller
But the call to pause Crucible is synchronous, and Crucible is having a difficult time, so it doesn't pause right away. 12 seconds later, a request to start appears and is rejected because the VM is going away:
Apr 23 20:26:23.438 INFO Queuing external request, disposition: Deny(HaltPending), request: start, component: external_request_queue
33 seconds later, Crucible unsticks itself and finishes pausing, which lets the VM controller proceed to actually shut down the VM:
Apr 23 20:26:56.955 INFO vcpu_tasks: exit all
Sending pause request to qemu-fwcfg, component: vm_controller
Apr 23 20:26:56.955 INFO Sending pause request to qemu-ramfbcontroller: halt entities, component: vm_controller
Apr 23 20:26:56.955 INFO Waiting for entities to pause, component: vm_controller
After this, the instance is destroyed (i.e. no longer ensured), so attempting to run it fails with an "instance not present" error instead of an "invalid transition" error.
Note, however, that any long-running entity pause operation, whether synchronous (the call to pause the entity didn't return right away, as in this case) or asynchronous (the paused
future takes forever to complete) will cause this sort of problem.
We've seen this issue crop up in dogfood, with instances being stuck stopping/rebooting (usually because crucible is stuck). One example is oxidecomputer/crucible#837.
I don't think this is a propolis problem per se: I think what we really need is for entities to forcefully shut themselves down when requested to. I presume that will require some changes on the crucible end, but I haven't looked into what that would entail yet.
Crucible did have a bug that prevented it from completing the deactivation, but crucible should also handle getting the rug pulled out from under it (though cleanup may be required which would increase future activation time). We should look at the interface that propolis is using to disconnect from crucible and see if there is some additional disconnect interface we need to add to support this.
I think the current behavior is try to disconnect and wait (possibly forever) till the upstairs has a clean shutdown.
Would we want another behavior, where there is a timeout or something like that where we bound the wait and have crucible pull the rug under itself if the timer elapses and return to propolis?
One thing I noticed in my latest round of debugging #336 was that I couldn't stop or reboot the guest in question. In the case of #336, the crucible upstairs and downstairs were incompatible, and the failure mode was that only a handful of I/Os were making it through, so the guest wasn't able to do much. While this obviously isn't the happy path, I think we need to be able to stop and reboot instances that get stuck in this way.
It's easy to reproduce some form of #336 by combining a downstairs at
894d44
and an upstairs ate7ce7a
. For this case,reboot
seems to work, butstop
failed, first with a 400, then a 500:server logs:
I can't recall if this is the exact failure mode I saw when debugging #336, but this particular instance of it feels in the same realm as #363 (maybe?).