oxidecomputer / propolis

VMM userspace for illumos bhyve
Mozilla Public License 2.0
178 stars 22 forks source link

Better error reporting mechanism needed #335

Open pfmooney opened 1 year ago

pfmooney commented 1 year ago

When running a guest, there are certain conditions which may cause one or more of the vCPUs to exit from VM context with an event we are unable to properly handle. Notable examples of this include #333 and #300, where the guests appear to have jumped off into "space", leaving the instruction fetch/decode/emulate machinery unable to do its job.

The course of action we choose in such situations can have certain trade-offs. The current behavior of aborting the propolis process has the advantage of attempting to preserve the maximum state of both the userspace emulation (saved in the core) as well as the kernel VMM and instance state (residing in the vmm device instance, as long as it is not removed). This may be beneficial for developer applications, but for running hosts in production, it is likely less than ideal.

Consumers in production likely expect a VM encountering an error condition like that to reboot, as if it had tripped over something like a triple-fault on a CPU. Rebooting the instance promptly at least allows it to return to service quickly. In such cases, we need to think about what bits of state we would want preserved from the machine and the fault conditions so it can be used for debugging later. In addition to the details about the vm-exit on the faulting vCPU(s), we could export the entire emulated device state (not counting DRAM) as if a migration were occurring. Customer policy could potentially choose to prune that down, or even augment it with additional state from the guest (perhaps the page of memory underlying %rip at the time of exit?)

With such a mechanism in place, we could still preserve the abort-on-unhandled-vmexit behavior if it is desired by developer workflows, but default to the more graceful mechanism for all other cases.

hawkw commented 2 months ago

There's work currently in progress in the control plane to allow Nexus to automatically restart instances whose propolis-server has crashed (if configured to do so). In particular, oxidecomputer/omicron#6455 moves instances to the Failed state when their VMM has crashed, and oxidecomputer/omicron#6503 will add a RPW for restarting Failed instances, if they have an "auto-restart" configuration set.

Potentially, we could leverage that here and just allow propolis-servers that encounter this kind of guest misbehavior to crash and leave behind a core dump, knowing that the control plane will restart the instance if that's what the user wanted. On the other hand, this is potentially less efficient than restarting the guest within the same propolis-server, since it requires the control plane to spin up a whole new VMM and start the instance there. But, I figured it was worth mentioning!

pfmooney commented 2 months ago

In the case of #755 (and similar circumstances), I don't think that crashing is at all ideal. If we have a mechanism for surfacing information for support, it's probably more sensible to collect additional information about the state of the guest (registers, etc), since figuring that out from the propolis core dump alone with be challenging, if not impossible.