Fix a race which caused containerd to stop responding.

wrouesnel commented 7 years ago

When a virtual machine error occurs leading to a VM shutdown, there is no explicit denial of access to the vm.Hub channel used to send commands, but once the VmContext.loop() function exited, nothing existed handling and cancelling commands on that channel, nor notifying processes to stop writing to it.

This meant that runv could reliably trigger Docker/supervisord lock-ups when a problem occurred with the VM startup while container commands were being waited on (most commonly: docker run <some image> sleep 10, but having an incorrect kernel configured in docker).

This commit fixes the problem by removing the duplication of channel logic between Vm and VmContext. Vm.SendGenericOperation is made internal, and calls to a wrapper function on VmContext.SendVmEvent to enqueue messages to the context. This separation enables a check for an active event handler to be performed before enqueuing a new message - and immediately returns an error if the message cannot be queued allowing the caller to detect a situation where the VM has died.

Closes #423.

wrouesnel commented 7 years ago

Note: this probably fixes a whole host of containerd not responding related issues when there are VM state changes, as the problem is generic to any changes in VMs being started and stopped.

laijs commented 7 years ago

SendGenericOperation() is going to be removed, could you change vm.Kill() to call the v.ctx.poweroffVM() directly please?

quote from your log in the #423

goroutine 45 [chan receive]:
github.com/hyperhq/runv/hypervisor.(*Vm).GenericOperation(0xc4202eae00, 0x997861, 0xb, 0x9e0820, 0xc420350a20, 0x2, 0x2, 0xc42013f658, 0xc42013f501)
    /home/will/src/go/src/github.com/hyperhq/runv/hypervisor/vm.go:739 +0xaa
github.com/hyperhq/runv/hypervisor.(*Vm).Kill(0xc4202eae00)
    /home/will/src/go/src/github.com/hyperhq/runv/hypervisor/vm.go:324 +0xa2
github.com/hyperhq/runv/supervisor.createHyperPod(0xe0e200, 0xc4201ee0c0, 0xc420161520, 0x1, 0x80, 0x200, 0x8ae800, 0xc4201616a8)
    /home/will/src/go/src/github.com/hyperhq/runv/supervisor/hyperpod.go:503 +0xa9d

wrouesnel commented 7 years ago

Done. The changes to use SendVmEvent() are still necessary to giving a consistent VM operating state though, since they prevent blocking in a case where the event handler is detached.

hyperhq / runv

Fix a race which caused containerd to stop responding. #424