vmware-archive / vsphere-flocker-driver

vSphere Flocker driver
Other
24 stars 16 forks source link

Not able to deploy Docker containers #7

Closed jm66 closed 8 years ago

jm66 commented 8 years ago

HI @pratikvgupta, https://github.com/vmware/vsphere-flocker-driver/pull/6 was just an error I caught last week, but I haven't been able to properly deploy a container unfortunately. There seems to be a big lag between the vsphere plugin creating the disk in the datastore, presenting it to the VM and responding to docker engine so it times out. We have many VMs (1.4k+) and I see there are constant scans looking for flocker disks. I'm wondering wether this behaviour is related to what I'm seeing.

Let me describe a bit our environment:

vCenter

Full name    : VMware vCenter Server 5.5.0 build-2442329
Version      : 5.5.0
OS           : win32-x64

Datastore

Type                  : NFS, NexentaStor
Capacity              : 26865.8909149 GiB
Free Space            : 25369.5726967 GiB
Uncommitted           : 953.775393963 GiB
Provisioned           : 2450.09361219 GiB
Hosts                 : 3
Virtual Machines      : 4

Flocker VM

Guest OS              : Ubuntu Linux (64-bit) 14.04.3 LTS
Version               : vmx-08 (No upgrade scheduled)
State                 : poweredOn
VMware Tools          : guestToolsUnmanaged
CPU/Mem HotAdd        : Enabled/Enabled
CPU & Memory          : 2, 3072 MB
Provisioned Storage   : 268.00 GiB on 4 Disk(s)
Disk(s)               : Hard disk 1 | 40.0 GiB | [NFS-DS] 1601T-flocker-manager/1601T-flocker-manager.vmdk | Thin provisioned
                        Hard disk 2 | 76.0 GiB | [NFS-DS] FLOCKER/b0a29ade-bce8-4db0-b6e9-b03b6c2f3fda.vmdk | Thin provisioned
                        Hard disk 3 | 76.0 GiB | [NFS-DS] FLOCKER/1dd15ae6-995f-436d-885e-c59a642dfb1b.vmdk | Thin provisioned
                        Hard disk 4 | 76.0 GiB | [NFS-DS] FLOCKER/4470c983-7d4a-4f55-9ba6-09f9f50aefd2.vmdk | Thin provisioned

OS packages

dpkg -l | egrep " open-vm-tools|scsitools|sg3-utils"
ii  open-vm-tools                       2:9.4.0-1280544-5ubuntu6.2       amd64        Open VMware Tools for virtual machines hosted on VMware (CLI)
ii  scsitools                           0.12-2.2ubuntu1                  amd64        Collection of tools for SCSI hardware management
ii  sg3-utils                           1.36-1ubuntu1                    amd64        utilities for devices using the SCSI command set

OS services

service flocker-dataset-agent status
flocker-dataset-agent start/running, process 1129

service flocker-docker-plugin status
flocker-docker-plugin start/running, process 905

service flocker-control status
flocker-control start/running, process 1101

service flocker-container-agent status
flocker-container-agent start/running, process 1496

service flocker-control status
flocker-control start/running, process 1101

agent.yml

"version": 1
"control-service":
   "hostname": "<hostname>"
   "port": 4524

dataset:
  backend: "vsphere_flocker_plugin"
  vc_ip: "<IP>"
  username: "flocker"       # Admin privileges
  password: "<PWD>"
  datacenter_name: "<DC>"           # VC datacenter name
  datastore_name: "<NFS-DS>"         # VC datastore name as above

The problem here is that when I run docker run -v apples:/data --volume-driver flocker busybox sh -c "echo hello > /data/file.txt" I get timeouts from the dataset agent. I'm not quite sure yet how the vsphere-flocker-driver works but I'd appreciate any guidance from your end.

docker run -v apples:/data --volume-driver flocker busybox sh -c "echo hello > /data/file.txt"
Error response from daemon: Cannot start container 804860625a45be50a58ba76fe75e003fd0546f462596cc2ff1e7b1c8972ab133: Timed out waiting for dataset to mount.

Looking at the flocker-docker-plugin I get messages like these:

{"task_uuid": "7cc42fc1-eabd-43a8-9d35-f3d824959006", "error": false, "timestamp": 1454000404.950405, "message": "Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x7fbf893883f8>", "message_type": "twisted:log", "task_level": [1]}
{"request_body": null, "url": "https://f-control:4523/v1/state/datasets", "timestamp": 1454000405.949854, "action_status": "started", "task_uuid": "20ce1981-9eff-4e9f-8d73-6bd339f33e15", "action_type": "flocker:apiclient:http_request", "method": "GET", "task_level": [3, 4, 116, 1]}
{"task_uuid": "20ce1981-9eff-4e9f-8d73-6bd339f33e15", "error": false, "timestamp": 1454000405.954233, "message": "Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x7fbf88102dd0>", "message_type": "twisted:log", "task_level": [3, 4, 116, 3]}
{"task_uuid": "20ce1981-9eff-4e9f-8d73-6bd339f33e15", "task_level": [3, 4, 116, 4], "action_type": "flocker:apiclient:http_request", "response_body": [], "timestamp": 1454000405.995377, "response_code": 200, "action_status": "succeeded"}
{"timestamp": 1454000405.995816, "task_uuid": "20ce1981-9eff-4e9f-8d73-6bd339f33e15", "message_type": "flocker:common:loop_until:iteration", "result": "None", "task_level": [3, 4, 117]}
{"task_uuid": "7609d3f5-f5bb-4e24-a992-e6e9f25b2767", "error": false, "timestamp": 1454000405.996877, "message": "Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x7fbf88102dd0>", "message_type": "twisted:log", "task_level": [1]}

The flocker-control log looks like:

{"timestamp": 1454000464.610519, "task_uuid": "0772c67c-910b-4f68-82e4-f9e4e0fffb3a", "message_type": "flocker-control:persistence:unchanged-deployment-not-saved", "task_level": [1]}
{"timestamp": 1454000464.813629, "task_uuid": "20ce1981-9eff-4e9f-8d73-6bd339f33e15", "action_type": "eliot:remote_task", "action_status": "started", "task_level": [3, 4, 228, 2, 1]}
{"task_uuid": "20ce1981-9eff-4e9f-8d73-6bd339f33e15", "task_level": [3, 4, 228, 2, 2, 1], "action_type": "api:request", "timestamp": 1454000464.814184, "method": "GET", "action_status": "started", "request_path": "/v1/state/datasets"}
{"task_uuid": "20ce1981-9eff-4e9f-8d73-6bd339f33e15", "task_level": [3, 4, 228, 2, 2, 3, 1], "action_type": "api:json_request", "timestamp": 1454000464.81474, "json": {}, "action_status": "started"}
{"task_uuid": "20ce1981-9eff-4e9f-8d73-6bd339f33e15", "code": 200, "task_level": [3, 4, 228, 2, 2, 3, 2], "action_type": "api:json_request", "timestamp": 1454000464.815296, "json": [], "action_status": "succeeded"}
{"timestamp": 1454000464.815901, "task_uuid": "20ce1981-9eff-4e9f-8d73-6bd339f33e15", "action_type": "api:request", "action_status": "succeeded", "task_level": [3, 4, 228, 2, 2, 4]}
{"timestamp": 1454000464.816279, "task_uuid": "20ce1981-9eff-4e9f-8d73-6bd339f33e15", "action_type": "eliot:remote_task", "action_status": "succeeded", "task_level": [3, 4, 228, 2, 3]}
{"task_uuid": "8ab9373b-e0b4-4718-bc3b-7f31b5560966", "error": false, "timestamp": 1454000464.817009, "message": "\"142.150.128.227\" - - [28/Jan/2016:17:01:04 +0000] \"GET /v1/state/datasets HTTP/1.1\" 200 2 \"-\" \"-\"", "message_type": "twisted:log", "task_level": [1]}
{"timestamp": 1454000465.610146, "task_uuid": "19dea4c3-3691-42cf-aa7f-2154f7c4ae2b", "message_type": "flocker-control:persistence:unchanged-deployment-not-saved", "task_level": [1]}

Querying datasets via Flocker API:

[{"deleted": true, "dataset_id": "7a7bb613-dd76-4309-98bd-df2dd464b281", "maximum_size": 80530636800, "primary": "b97b1c43-649e-4029-aafe-46b7acaba629", "metadata": {"name": "apples"}}, {"deleted": true, "dataset_id": "2ae3bb0b-66f5-4ee7-a2cf-b73e020783ec", "maximum_size": 80530636800, "primary": "b97b1c43-649e-4029-aafe-46b7acaba629", "metadata": {"name": "apples"}}, {"deleted": false, "dataset_id": "4470c983-7d4a-4f55-9ba6-09f9f50aefd2", "maximum_size": 80530636800, "primary": "b97b1c43-649e-4029-aafe-46b7acaba629", "metadata": {"name": "volumename"}}, {"deleted": true, "dataset_id": "4d95228e-f1dd-4d15-9929-1783fb1f908c", "maximum_size": 80530636800, "primary": "b97b1c43-649e-4029-aafe-46b7acaba629", "metadata": {"name": "apples"}}, {"deleted": false, "dataset_id": "b0a29ade-bce8-4db0-b6e9-b03b6c2f3fda", "maximum_size": 80530636800, "primary": "b97b1c43-649e-4029-aafe-46b7acaba629", "metadata": {"name": "apples"}}, {"deleted": false, "dataset_id": "1dd15ae6-995f-436d-885e-c59a642dfb1b", "maximum_size": 80530636800, "primary": "b97b1c43-649e-4029-aafe-46b7acaba629", "metadata": {"name": "fastvol", "clusterhq:flocker:profile": "gold"}}, {"deleted": true, "dataset_id": "bc114529-737b-413d-adb4-f9a41f6c170d", "maximum_size": 80530636800, "primary": "b97b1c43-649e-4029-aafe-46b7acaba629", "metadata": {"name": "volumename"}}, {"deleted": true, "dataset_id": "31116cd1-16f4-434f-953c-13b492fd6273", "maximum_size": 80530636800, "primary": "b97b1c43-649e-4029-aafe-46b7acaba629", "metadata": {"name": "f842ebfa84081d3ed44b59470422cf5f02563d5e3bd7386942632d4d810e1723"}}, {"deleted": true, "dataset_id": "c72c1dc1-44a3-46e4-849f-124d868568b2", "maximum_size": 80530636800, "primary": "b97b1c43-649e-4029-aafe-46b7acaba629", "metadata": {"name": "apples"}}]
pratikvgupta commented 8 years ago

Can you please share the "/var/log/flocker/vsphere.log"?

jm66 commented 8 years ago

Will do, need to massage the log since stores passwords in plain text.

jm66 commented 8 years ago

Hi @pratikvgupta found something interesting here:

016-02-02 12:27:37,646 - vsphere_blockdevice - DEBUG - _list_vsphere_volumes: {u'6000c29a4c05d81eac0de98b3483a568': <vsphere_flocker_plugin.vsphere_blockdevice.VsphereBlockDeviceVolume instance at 0x7f9195843c68>}
2016-02-02 12:27:37,658 - vsphere_blockdevice - DEBUG - _find_all_disk_devices: KNAME TYPE
fd0   disk
sda   disk
sdb   disk
sr0   rom

2016-02-02 12:27:37,658 - vsphere_blockdevice - DEBUG - _find_all_disk_devices: ['KNAME TYPE', 'fd0   disk', 'sda   disk', 'sdb   disk', 'sr0   rom', '']
2016-02-02 12:27:37,658 - vsphere_blockdevice - DEBUG - _find_all_disk_devices: ['fd0   disk', 'sda   disk', 'sdb   disk', 'sr0   rom']
2016-02-02 12:27:37,658 - vsphere_blockdevice - DEBUG - _find_all_disk_devices: ['/dev/fd0', '/dev/sda', '/dev/sdb']
2016-02-02 12:27:37,664 - vsphere_blockdevice - ERROR - get_device_path: Error occured for scsiinfo -s /dev/fd0
2016-02-02 12:27:37,669 - vsphere_blockdevice - DEBUG - get_device_path: 
Serial Number '6000c294f6b1c9592661c1f1e4ef7e47'

2016-02-02 12:27:37,669 - vsphere_blockdevice - DEBUG - get_device_path: '6000c294f6b1c9592661c1f1e4ef7e47'

2016-02-02 12:27:37,669 - vsphere_blockdevice - DEBUG - get_device_path: 6000c294f6b1c9592661c1f1e4ef7e47
2016-02-02 12:27:37,669 - vsphere_blockdevice - DEBUG - get_device_path: 6000c29a4c05d81eac0de98b3483a568
2016-02-02 12:27:37,674 - vsphere_blockdevice - DEBUG - get_device_path: 
Serial Number '6000c29a4c05d81eac0de98b3483a568'

2016-02-02 12:27:37,674 - vsphere_blockdevice - DEBUG - get_device_path: '6000c29a4c05d81eac0de98b3483a568'

2016-02-02 12:27:37,674 - vsphere_blockdevice - DEBUG - get_device_path: 6000c29a4c05d81eac0de98b3483a568
2016-02-02 12:27:37,675 - vsphere_blockdevice - DEBUG - get_device_path: 6000c29a4c05d81eac0de98b3483a568
2016-02-02 12:27:37,675 - vsphere_blockdevice - DEBUG - get_device_path: Found device path : /dev/sdb
2016-02-02 12:27:40,809 - vsphere_blockdevice - DEBUG - list_volumes: starting list volumes

Also, it's taking more than 50s to list volumes:

2016-02-02 12:28:48,080 - vsphere_blockdevice - DEBUG - _list_vsphere_volumes: {u'6000c29a4c05d81eac0de98b3483a568': <vsphere_flocker_plugin.vsphere_blockdevice.VsphereBlockDeviceVolume instance at 0x7f91958460e0>}
2016-02-02 12:28:48,080 - vsphere_blockdevice - DEBUG - list_volumes: vsphere list_volumes: [BlockDeviceVolume(blockdevice_id=u'6000c29a4c05d81eac0de98b3483a568', dataset_id=UUID('27addb92-572c-401f-8722-8f08b3122505'), attached_to=u'vm-6332', size=81604378624)]
2016-02-02 12:28:48,081 - vsphere_blockdevice - DEBUG - list_volumes: 1454434060.81
2016-02-02 12:28:48,081 - vsphere_blockdevice - DEBUG - list_volumes: Took 67.2716050148 seconds
2016-02-02 12:28:48,111 - vsphere_blockdevice - DEBUG - list_volumes: starting list volumes

How can I shere the full log with you?

pratikvgupta commented 8 years ago

Can you upload the logs to some shared location like dropbox? I am not sure if github has some way to share logs.

I suspect the latency in listing volumes is due to huge inventory of VMs. And this may be the reason for timeouts while deploying container. I am thinking of a solution to restrict the search for the attached volumes only to the VMs provisioned on the datastore provided in config. Currently, it searches all VMs in VC.

jm66 commented 8 years ago

Hi @pratikvgupta, just shared the log.

That makes sense, we have near 1500 VMs in our environment. Yes, limiting the scope of search sounds pretty good. Maybe using a ContainerView or more sophisticated with PropertyCollector?

Please, let me know how can I help.