vm/vmm: vmd unrecoverably deadlocks

dvyukov commented 5 years ago

OpenBSD instance stopped working after syz-ci restart because all vmd VMs are in some dead unrecoverable state. Creation of new VMs always fails:

2018/09/21 01:56:52 failed to create instance: vmm exited
vmctl: start vm command failed: Operation already in progress
2018/09/21 01:57:41 failed to create instance: vmm exited
vmctl: start vm command failed: Operation already in progress
2018/09/21 01:57:55 failed to create instance: vmm exited
vmctl: start vm command failed: Operation already in progress
2018/09/21 01:58:07 failed to create instance: vmm exited

stop -f does not kill them:

35# ps axu | grep vmd  
_vmd     77028 98.8  0.5 2099972 157648 ??  Rp/2  Thu06AM  1188:59.49 vmd: ci-openbsd-main-1 (vmd)
_vmd     48147 98.2  0.5 2099988 149804 ??  Rp/2  Thu07AM  1082:45.19 vmd: ci-openbsd-main-0 (vmd)
_vmd     66978 97.9  0.3 2099884 90192 ??  Rp/3  Thu12PM  817:11.92 vmd: ci-openbsd-main-2 (vmd)
35# vmctl stop ci-openbsd-main-0 -f
vmctl: requested to terminate vm 12
35# vmctl stop ci-openbsd-main-0 -f 
vmctl: requested to terminate vm 12
35# vmctl stop ci-openbsd-main-0 -f 
vmctl: requested to terminate vm 12
35# vmctl stop ci-openbsd-main-0 -f 
vmctl: requested to terminate vm 12
35# vmctl stop ci-openbsd-main-0 -f 
vmctl: requested to terminate vm 12
35# ps axu | grep vmd               
_vmd     77028 97.7  0.5 2099972 157648 ??  Rp/2  Thu06AM  1192:23.73 vmd: ci-openbsd-main-1 (vmd)
_vmd     48147 97.7  0.5 2099988 149804 ??  Rp/2  Thu07AM  1086:09.81 vmd: ci-openbsd-main-0 (vmd)
_vmd     66978 98.5  0.3 2099884 90192 ??  Rp/3  Thu12PM  820:36.52 vmd: ci-openbsd-main-2 (vmd)

@mptre @blackgnezdo

dvyukov commented 5 years ago

It helped to kill -9 these vmd processes. We probably should automate this killing somehow, or of course fix vmd.

dvyukov commented 5 years ago

It again stuck in vmctl: start vm command failed: Operation already in progress dead loop.

blackgnezdo commented 5 years ago

@pdvyas, is this commit likely to fix the issue here? https://github.com/openbsd/src/commit/b9c4a1f5649a1ac6b055dfda173622d298750a92

pdvyas commented 5 years ago

That fix should be unrelated. That commit addresses losing access to a vm when sending (vm migration) fails.

There is a thread on openbsd-misc about this: https://marc.info/?l=openbsd-tech&m=153817959129638&w=2

blackgnezdo commented 5 years ago

This appears to be the same issue that I documented fairly extensively in "vmd losing VMs" thread on openbsd-tech. If I had to guess, some IPC between the different vmd processes goes awry. Debugging is fairly frustrating due to my being unable to find any IPC tracing mechanisms.

I think an automated daily reboot is a fine idea due to syz-bot recovering state from disk in short order.

dvyukov commented 5 years ago

For the 2 cases that I observed, killing a vmd process helped. So if we could map vmctl instance to vmd process pid, then we could kill vmd when closing a VM. A simpler version of this is to run "killall -9 vmd" every n hours.

google / syzkaller

vm/vmm: vmd unrecoverably deadlocks #740