vmware-archive / vsphere-storage-for-docker

vSphere Storage for Docker
https://vmware.github.io/vsphere-storage-for-docker
Apache License 2.0
251 stars 95 forks source link

VMs with orphaned attachments won't boot if attachment is already on new VM #515

Closed justinclayton closed 8 years ago

justinclayton commented 8 years ago

When container cluster managers are in play, containers that die suddenly due to hardware failure will be rescheduled on another Docker host VM before the VM is ready for use again. Unfortunately, in this recovery scenario, the VM that was powered down never gets a chance to detach VMDKs that are in use as Docker volumes, leaving the VM with an essentially orphaned attachment that is no longer required, and actually prevents it from booting back up if that VMDK is already attached elsewhere.

My setup:

Here's the repro:

# /usr/lib/vmware/vmdkops/bin/vmdkops_admin.py ls
Volume  Datastore        Created By VM  Created                   Attached To VM  Policy  Capacity  Used
------  ---------------  -------------  ------------------------  --------------  ------  --------  --------
vol1  datastore1  vm-1    Wed Jul  6 18:08:43 2016  vm-1          N/A     10.00GB   252.00MB
* An error was received from the ESX host while powering on VM vm-1.
* Failed to start the virtual machine.
* Module Disk power on failed. 
* Cannot open the disk '/vmfs/volumes/57507243-ad6492fd-d63c-ecf4bbc7e390/dockvols/vol1.vmdk' or one of the snapshot disks it depends on.
* Failed to lock the file

The two workarounds to this currently are:

After employing one of the above steps you will be able to power on vm-1 successfully.

msterin commented 8 years ago

@justinclayton - thank for the detailed report ! Related to #92 (which is low priority since Docker fixed unmount in container 'rm -f'). @pdhamdhere - suggested fix is to listen to VM power events and auto-detach all in dockvols on power off.

govint commented 8 years ago

From the description it looks like the bug happened in step 4 of the repro. If the volume was in use by VM1 when it was shutdown by force then the volume plugin should have disallowed vm2 from attaching the volume.

The correct way should have been to,

Figure that the volume was attached to another VM

Make a check that the other VM is on and if not then

remove the volume from that VM (VM1) configuration and

attach the volume to the requesting VM (VM2)

Its wrong to have allowed the attach to VM2 when the volume is already attached to a VM and we have no idea if thats using it or not.

govint commented 8 years ago

I checked the provision for alarms and events in VC and apparently we can create a VM power-state alarm to run a a script "on the VC server" or run a method (from a list thats in VC). Its perhaps better for the plugin to figure that if a VM is asking to attach a volume then figure if the volume is attached to a live VM (which can be queried) and then proceed to detach the volume from the down-VM.

govint commented 8 years ago

Closed via #573