vmware-archive / vsphere-storage-for-docker

vSphere Storage for Docker
https://vmware.github.io/vsphere-storage-for-docker
Apache License 2.0
251 stars 95 forks source link

Error creating volumes #1969

Closed AndrewSav closed 6 years ago

AndrewSav commented 6 years ago

Ocasionally a VM would get screwed up. All other VMs are working fine and you can create a volume from, them but from this specific VM you no longer can:

I ran docker volume create --driver=vsphere --name=myvolume -o size=10gb

I'm getting:

2017-11-01 07:57:55.909843423 +0000 UTC [INFO] Attaching volume and creating filesystem fstype=ext4 name=myvolume
2017-11-01 07:57:57.806053767 +0000 UTC [WARNING] Get device path failed for unit# %s @ PCI slot %s: 01216Error="open /sys/bus/pci/slots/1216/address: no such file or directory"
2017-11-01 07:57:57.806110812 +0000 UTC [ERROR] Failed to get device path volDev={0 1216} err="Device not found"
2017-11-01 07:57:57.806135583 +0000 UTC [ERROR] Could not find attached device, removing the volume name=myvolume error="Device not found"

Reboot does not help. Removing and re-installing the plugin does not help. I have to re-provision VM to get it working again.

Can some one help troubleshoot? From the excerpt above, I think that 1216 is wrong since there is no pci slot like this. Where does this 1216 come from?

AndrewSav commented 6 years ago

Also depite that I put "LogLevel": "debug", into /etc/docker-volume-vsphere.conf the pugin starts with loglevel of info:

Starting plugin driver=vsphere log_level=info config="/etc/docker-volume-vsphere.conf" group=root

Note that when I change "LogPath" in the same file - it actually works, the log path changes, but not the log level for some reason.

andrewsav-bt commented 6 years ago

Here is corresponding (well not really corresponding, but I trust they are the same except for date) part from server logs:

11/01/17 22:36:55 505081 [MainThread] [INFO   ] Connecting to the local Service Instance as 'dcui' 
11/01/17 22:36:55 505081 [MainThread] [INFO   ] Started new thread : 81931120 with target <function execRequestThread at 0x6c233994> and args (18, 903899, '{"cmd":"get","details":{"Name":"myvolume"},"version":"2"}')
11/01/17 22:36:55 505081 [Thread-667] [INFO   ] Auth DB /etc/vmware/vmdkops/auth-db is missing, allowing all access
11/01/17 22:36:55 505081 [Thread-667] [INFO   ] Connecting to the local Service Instance as 'dcui' 
11/01/17 22:36:56 505081 [<vm_name_replaced>-<datastore_name_replaced>._DEFAULT.myvolume] [INFO   ] executeRequest 'get' completed with ret={u'Error': 'Volume myvolume not found (file: /vmfs/volumes/<datastore_name_replaced>/dockvols/_DEFAULT/myvolume.vmdk)'}
11/01/17 22:36:56 505081 [MainThread] [INFO   ] Started new thread : 81931120 with target <function execRequestThread at 0x6c233994> and args (22, 903899, '{"cmd":"create","details":{"Name":"myvolume","Opts":{"fstype":"ext4","size":"10gb"}},"version":"2"}')
11/01/17 22:36:56 505081 [Thread-668] [INFO   ] Auth DB /etc/vmware/vmdkops/auth-db is missing, allowing all access
11/01/17 22:36:56 505081 [<vm_name_replaced>-<datastore_name_replaced>._DEFAULT.myvolume] [INFO   ] *** createVMDK: /vmfs/volumes/<datastore_name_replaced>/dockvols/_DEFAULT/myvolume.vmdk opts={u'size': u'10gb', u'fstype': u'ext4'} vm_name=<vm_name_replaced> vm_uuid=423261b0-c9c9-d407-0d4b-a54f5f747c4b tenant_uuid=11111111-1111-1111-1111-111111111111 datastore_url=/vmfs/volumes/542b6697-e3f022ab-b393-9cb6549b20b8
11/01/17 22:36:56 505081 [<vm_name_replaced>-<datastore_name_replaced>._DEFAULT.myvolume] [INFO   ] No access control, skipping volumes tracing in auth DB
11/01/17 22:36:56 505081 [<vm_name_replaced>-<datastore_name_replaced>._DEFAULT.myvolume] [INFO   ] executeRequest 'create' completed with ret=None
11/01/17 22:36:56 505081 [MainThread] [INFO   ] Started new thread : 81931120 with target <function execRequestThread at 0x6c233994> and args (18, 903899, '{"cmd":"attach","details":{"Name":"myvolume"},"version":"2"}')
11/01/17 22:36:56 505081 [Thread-669] [INFO   ] Auth DB /etc/vmware/vmdkops/auth-db is missing, allowing all access
11/01/17 22:36:56 505081 [<vm_name_replaced>-<datastore_name_replaced>._DEFAULT.myvolume] [INFO   ] *** disk_attach: VMDK /vmfs/volumes/<datastore_name_replaced>/dockvols/_DEFAULT/myvolume.vmdk to VM '<vm_name_replaced>' , bios uuid = 423261b0-c9c9-d407-0d4b-a54f5f747c4b, VC uuid=503275c6-e840-0e8e-75f3-85c8e72defd8)
11/01/17 22:36:56 505081 [<vm_name_replaced>-<datastore_name_replaced>._DEFAULT.myvolume] [INFO   ] Found vm name='<vm_name_replaced>'
11/01/17 22:36:57 505081 [<vm_name_replaced>-<datastore_name_replaced>._DEFAULT.myvolume] [INFO   ] Attaching /vmfs/volumes/<datastore_name_replaced>/dockvols/_DEFAULT/myvolume.vmdk as independent_persistent
11/01/17 22:36:57 505081 [<vm_name_replaced>-<datastore_name_replaced>._DEFAULT.myvolume] [INFO   ] Disk /vmfs/volumes/<datastore_name_replaced>/dockvols/_DEFAULT/myvolume.vmdk successfully attached. controller pci_slot_number=1216, disk_slot=0
11/01/17 22:36:57 505081 [<vm_name_replaced>-<datastore_name_replaced>._DEFAULT.myvolume] [INFO   ] executeRequest 'attach' completed with ret={'ControllerPciSlotNumber': '1216', 'Unit': '0'}
11/01/17 22:36:57 505081 [MainThread] [INFO   ] Started new thread : 81931120 with target <function execRequestThread at 0x6c233994> and args (22, 903899, '{"cmd":"detach","details":{"Name":"myvolume"},"version":"2"}')
11/01/17 22:36:57 505081 [Thread-670] [INFO   ] Auth DB /etc/vmware/vmdkops/auth-db is missing, allowing all access
11/01/17 22:36:58 505081 [<vm_name_replaced>-<datastore_name_replaced>._DEFAULT.myvolume] [INFO   ] *** disk_detach: VMDK /vmfs/volumes/<datastore_name_replaced>/dockvols/_DEFAULT/myvolume.vmdk to VM '<vm_name_replaced>' , bios uuid = 423261b0-c9c9-d407-0d4b-a54f5f747c4b, VC uuid=503275c6-e840-0e8e-75f3-85c8e72defd8)
11/01/17 22:36:58 505081 [<vm_name_replaced>-<datastore_name_replaced>._DEFAULT.myvolume] [INFO   ] Found vm name='<vm_name_replaced>'
11/01/17 22:36:58 505081 [<vm_name_replaced>-<datastore_name_replaced>._DEFAULT.myvolume] [INFO   ] Disk detached /vmfs/volumes/<datastore_name_replaced>/dockvols/_DEFAULT/myvolume.vmdk
11/01/17 22:36:58 505081 [<vm_name_replaced>-<datastore_name_replaced>._DEFAULT.myvolume] [INFO   ] executeRequest 'detach' completed with ret=None
11/01/17 22:36:58 505081 [MainThread] [INFO   ] Started new thread : 81931120 with target <function execRequestThread at 0x6c233994> and args (18, 903899, '{"cmd":"remove","details":{"Name":"myvolume"},"version":"2"}')
11/01/17 22:36:58 505081 [Thread-671] [INFO   ] Auth DB /etc/vmware/vmdkops/auth-db is missing, allowing all access
11/01/17 22:36:58 505081 [<vm_name_replaced>-<datastore_name_replaced>._DEFAULT.myvolume] [INFO   ] *** removeVMDK: /vmfs/volumes/<datastore_name_replaced>/dockvols/_DEFAULT/myvolume.vmdk
11/01/17 22:36:58 505081 [<vm_name_replaced>-<datastore_name_replaced>._DEFAULT.myvolume] [INFO   ] *** cleanVMDK: /vmfs/volumes/<datastore_name_replaced>/dockvols/_DEFAULT/myvolume.vmdk
11/01/17 22:36:58 505081 [<vm_name_replaced>-<datastore_name_replaced>._DEFAULT.myvolume] [INFO   ] executeRequest 'remove' completed with ret=None
andrewsav-bt commented 6 years ago

Vmx file of the machine affected:

.encoding = "UTF-8"
config.version = "8"
virtualHW.version = "9"
vmci0.present = "TRUE"
floppy0.present = "FALSE"
numvcpus = "2"
memSize = "16384"
sched.cpu.units = "mhz"
sched.cpu.latencySensitivity = "normal"
powerType.suspend = "soft"
tools.upgrade.policy = "upgradeAtPowerCycle"
scsi0.virtualDev = "lsilogic"
scsi0.present = "TRUE"
ide1:0.startConnected = "FALSE"
ide1:0.deviceType = "atapi-cdrom"
ide1:0.fileName = "CD/DVD drive 0"
ide1:0.present = "TRUE"
scsi0:0.deviceType = "scsi-hardDisk"
scsi0:0.fileName = "<vm_name_replaced>.vmdk"
sched.scsi0:0.shares = "normal"
sched.scsi0:0.throughputCap = "off"
scsi0:0.present = "TRUE"
displayName = "<vm_name_replaced>"
annotation = "Docker ingress node built on 2017-10-31T04:40:23Z"
guestOS = "other"
toolScripts.afterPowerOn = "TRUE"
toolScripts.afterResume = "TRUE"
toolScripts.beforeSuspend = "TRUE"
toolScripts.beforePowerOff = "TRUE"
tools.syncTime = "TRUE"
tools.guest.desktop.autolock = "FALSE"
messageBus.tunnelEnabled = "FALSE"
uuid.bios = "42 32 61 b0 c9 c9 d4 07-0d 4b a5 4f 5f 74 7c 4b"
vc.uuid = "50 32 75 c6 e8 40 0e 8e-75 f3 85 c8 e7 2d ef d8"
pciBridge0.present = "TRUE"
svga.present = "TRUE"
pciBridge4.present = "TRUE"
pciBridge4.virtualDev = "pcieRootPort"
pciBridge4.functions = "8"
pciBridge5.present = "TRUE"
pciBridge5.virtualDev = "pcieRootPort"
pciBridge5.functions = "8"
pciBridge6.present = "TRUE"
pciBridge6.virtualDev = "pcieRootPort"
pciBridge6.functions = "8"
pciBridge7.present = "TRUE"
pciBridge7.virtualDev = "pcieRootPort"
pciBridge7.functions = "8"
hpet0.present = "true"
nvram = "<vm_name_replaced>.nvram"
virtualhw.productcompatibility = "hosted"
migrate.hostlog = "<vm_name_replaced>-1387a4da.hlog"
sched.cpu.min = "0"
sched.cpu.shares = "normal"
sched.mem.min = "0"
sched.mem.minSize = "0"
sched.mem.shares = "normal"
ethernet0.virtualDev = "vmxnet3"
ethernet0.networkName = "Admin Network"
ethernet0.addressType = "vpx"
ethernet0.generatedAddress = "00:50:56:b2:5a:d8"
ethernet0.uptCompatibility = "TRUE"
ethernet0.present = "TRUE"
ethernet1.virtualDev = "vmxnet3"
ethernet1.networkName = "Admin Network"
ethernet1.addressType = "vpx"
ethernet1.generatedAddress = "00:50:56:b2:38:b1"
ethernet1.uptCompatibility = "TRUE"
ethernet1.present = "TRUE"
ethernet2.virtualDev = "vmxnet3"
ethernet2.networkName = "Admin Network"
ethernet2.addressType = "vpx"
ethernet2.generatedAddress = "00:50:56:b2:4c:57"
ethernet2.uptCompatibility = "TRUE"
ethernet2.present = "TRUE"
ethernet3.virtualDev = "vmxnet3"
ethernet3.networkName = "Admin Network"
ethernet3.addressType = "vpx"
ethernet3.generatedAddress = "00:50:56:b2:0f:18"
ethernet3.uptCompatibility = "TRUE"
ethernet3.present = "TRUE"
ethernet4.virtualDev = "vmxnet3"
ethernet4.networkName = "Admin Network"
ethernet4.addressType = "vpx"
ethernet4.generatedAddress = "00:50:56:b2:09:c5"
ethernet4.uptCompatibility = "TRUE"
ethernet4.present = "TRUE"
sched.swap.derivedName = "/vmfs/volumes/542b6697-e3f022ab-b393-9cb6549b20b8/<vm_name_replaced>/<vm_name_replaced>-2cb3d608.vswp"
replay.supported = "FALSE"
replay.filename = ""
scsi0:0.redo = ""
pciBridge0.pciSlotNumber = "17"
pciBridge4.pciSlotNumber = "21"
pciBridge5.pciSlotNumber = "22"
pciBridge6.pciSlotNumber = "23"
pciBridge7.pciSlotNumber = "24"
scsi0.pciSlotNumber = "16"
ethernet0.pciSlotNumber = "160"
ethernet1.pciSlotNumber = "192"
ethernet2.pciSlotNumber = "224"
ethernet3.pciSlotNumber = "256"
ethernet4.pciSlotNumber = "1184"
vmci0.pciSlotNumber = "32"
vmci0.id = "1601469515"
monitor.phys_bits_used = "40"
vmotion.checkpointFBSize = "4194304"
cleanShutdown = "TRUE"
softPowerOff = "FALSE"
extendedConfigFile = "<vm_name_replaced>.vmxf"
scsi1.pciSlotNumber = "1216"
scsi1.virtualDev = "pvscsi"
scsi1.present = "TRUE"
config.readOnly = "FALSE"
uuid.location = "56 4d f2 67 9d db 93 b4-51 5d 88 b0 d4 a5 45 fa"
scsi1.sasWWID = "50 05 05 60 c9 c9 d5 00"
andrewsav-bt commented 6 years ago

Not sure if this is relevant: https://kb.vmware.com/s/article/2047927?language=en_US

andrewsav-bt commented 6 years ago

Okay, forget about me saying that it happened intermittently. I just re-build a VM from scratch and it's happening on the new VM. My suspicion that this is due to the fact that the machine has 5 network adapters, and thus the scsi slot get pushed to the secondary bus. The plugin then does not seem to be able to cope.

govint commented 6 years ago

@AndrewSav can you say what OS and version of OS you're using in the VM. Let me try a local repro and get back to you.

AndrewSav commented 6 years ago

@govint, thank you for this, the OS is Rancher OS 1.1.0.

govint commented 6 years ago

Ok, the ESX host service thats attaching the volumes does return the PCI slot number as is and the guest uses that to figure the device to mount. In this case the device is on a secondary PCI bus and this needs to be fixed in the ESX service as only it can figure which bridge the device is attached on and return the slot number w.r.t that. I'll post a fix by tomorrow.

govint commented 6 years ago

@AndrewSav, On the VM thats showing the issue can you please run ls /sys/bus/pci/slots and ls /sys/bus/pci/devices/0000:00:16.1 and post the results here.

FWIW, the files under /sys/bus/pci/slots already represent devices on the secondary PCI bus, but looks like the encoding done by vSphere isn't understood when a device is located under a non zero function of the PCI bridge. In Ubuntu and Photon OS we have PCIE bridges and the PVSCSI adapter shows up on function 0. Just want to check how the PVSCSI controller is showing up in rancherOS.

govint commented 6 years ago

I've made the fix in the ESX service to handle PVSCSI controller on the secondary PCI buses but that also needs a change how the client plugin determines the device under the appropriate PCI device (bridge in this case). Will post a PR once the changes are complete.

AndrewSav commented 6 years ago

@govint

[rancher@abigar ~]$  ls /sys/bus/pci/slots
160  162  164  166  192  194  196  198  224  226  228  230  256  258  260  262  32   34   36   38   40   42   44   46   48   50   52   54   56   58   60   62
161  163  165  167  193  195  197  199  225  227  229  231  257  259  261  263  33   35   37   39   41   43   45   47   49   51   53   55   57   59   61   63
[rancher@abigar ~]$ ls /sys/bus/pci/devices/0000:00:16.1
0000:00:16.1:pcie001      class                     device                    enable                    local_cpus                numa_node                 rescan                    subsystem_device
0000:00:16.1:pcie004      config                    dma_mask_bits             firmware_node             modalias                  pci_bus                   reset                     subsystem_vendor
0000:0c:00.0              consistent_dma_mask_bits  driver                    irq                       msi_bus                   power                     resource                  uevent
broken_parity_status      d3cold_allowed            driver_override           local_cpulist             msi_irqs                  remove                    subsystem                 vendor
[rancher@abigar ~]$
govint commented 6 years ago

@AndrewSav, thanks for posting the data requested, I've made a PR of the fixes to allow figuring out the position of the device on the client. It should get reviewed and merged soon.

govint commented 6 years ago

Closed via #1978

AndrewSav commented 6 years ago

@govint thank you. When is it available via docker hub?

shuklanirdesh82 commented 6 years ago

@AndrewSav It will be part of November month release .. we are shooting a release date (11/30) for vDVS 0.19.

I shall update this issue when vDVS 0.19 release available for your to try out.

shuklanirdesh82 commented 6 years ago

@AndrewSav vDVS 0.19 release is out please try out deliverable from https://github.com/vmware/docker-volume-vsphere/releases/tag/0.19

andrewsav-bt commented 6 years ago

@shuklanirdesh82 @govint Guys, I just tested this on the VM it used to have problems with and now it's working fine. Thank you for fixing this!

govint commented 6 years ago

@AndrewSav Thanks for confirming the fix, this was indeed something that absolutely needed to be addressed, in the way we detect the device on the docker host.