vmware-archive / vsphere-storage-for-docker

vSphere Storage for Docker
https://vmware.github.io/vsphere-storage-for-docker
Apache License 2.0
251 stars 95 forks source link

vFile volume driver is not fully initialized yet. #1960

Closed bartlenaerts closed 6 years ago

bartlenaerts commented 6 years ago

Hi,

I've installed vDVS as described in the documentation. This is working correctly.

docker volume create --driver=vsphere --name=SharedVol -o size=10gb
docker volume ls
DRIVER              VOLUME NAME
vsphere:latest      SharedVol@MSA_DS1

When I try to create a volume with vFile instead of vSphere I get an error:

docker volume create --driver=vfile --name=SharedVol1 -o size=10gb
Error response from daemon: create SharedVol1: VolumeDriver.Create: vFile volume driver is not fully initialized yet.

Both plugins are installed and enabled:

docker plugin ls
ID                  NAME                DESCRIPTION                           ENABLED
821d4601d65a        vsphere:latest      VMWare vSphere Docker Volume plugin   true
7aae40b225e5        vfile:latest        VMWare vFile Docker Volume plugin     true

Looks like I forgot to initialize the vFile driver. I've searched the documentation but I don't find anything about this error or how to correctly configure the plugin. Anybody who can help?

Thanks!

Bart

thiagolealassis commented 6 years ago

Are your docker in swarm mode ? this is an issue to me, how can i use it without swarm ?

luomiao commented 6 years ago

@thiagolealassis Hi and yes swarm is a necessary requirement for vFile plugin. Is your environment restricted to a particular container orchestrator?

luomiao commented 6 years ago

@bartlenaerts Hi Bart, Thanks for reporting this issue. As suggested by @thiagolealassis , could you please verify if your docker is in swarm mode? If yes, can you paste the log at /var/log/vfile.log?

bartlenaerts commented 6 years ago

@luomiao Hi,

thanks for looking in to this.

I'm running in swarm mode. We have a setup of 3 managers and 6 workers. All hosts have the vSphere and vFile plugins installed.

manager1

2017-10-30 15:49:43.46669919 +0000 UTC [INFO] Starting plugin driver=vfile log_level=info config="/etc/vfile.conf"
2017-10-30 15:49:43.468224536 +0000 UTC [INFO] Getting volume data from unix:///var/run/docker.sock
2017-10-30 15:49:45.468959877 +0000 UTC [INFO] Can't connect to unix:///var/run/docker.sock due to (An error occurred trying to connect: context deadline exceeded), skipping discovery
2017-10-30 15:49:45.469025026 +0000 UTC [INFO] Refcounting failed: (An error occurred trying to connect: context deadline exceeded).
2017-10-30 15:49:45.469078075 +0000 UTC [INFO] vFile plugin started version="vFile Volume Driver v0.2"
2017-10-30 15:49:45.469342679 +0000 UTC [INFO] Scheduling again after 2 seconds
2017-10-30 15:49:45.469470521 +0000 UTC [INFO] Started loading file server image
2017-10-30 15:49:45.470365338 +0000 UTC [INFO] Going into ServeUnix - Listening on Unix socket address="/run/docker/plugins/vfile.sock"
2017-10-30 15:49:47.469658453 +0000 UTC [INFO] Getting volume data from unix:///var/run/docker.sock
2017-10-30 15:49:49.470912421 +0000 UTC [INFO] Can't connect to unix:///var/run/docker.sock due to (An error occurred trying to connect: context deadline exceeded), skipping discovery
2017-10-30 15:49:49.471002649 +0000 UTC [INFO] Refcounting failed: (An error occurred trying to connect: context deadline exceeded). Attempts left: 20
2017-10-30 15:49:49.471024175 +0000 UTC [INFO] Scheduling again after 4 seconds
2017-10-30 15:49:53.472447961 +0000 UTC [INFO] Getting volume data from unix:///var/run/docker.sock
2017-10-30 15:49:53.567183475 +0000 UTC [INFO] Found 0 running or paused containers
2017-10-30 15:49:53.569472649 +0000 UTC [INFO] Discovered 0 volumes that may be in use.
2017-10-30 15:49:53.569532447 +0000 UTC [INFO] Refcounting successfully completed
2017-10-30 15:53:44.06466798 +0000 UTC [INFO] VolumeDriver Get: 83b6a0bcae897689e588998ba7cd25c86a698c0069b7efa8298efa874cd9ef42
2017-10-30 15:54:11.935853374 +0000 UTC [INFO] VolumeDriver Get: ee23d46de20bf7030d702207f2563f5b9c806867914c5751b0e605684046e1b5
2017-10-30 15:58:12.629964145 +0000 UTC [INFO] VolumeDriver Get: e7c3c19045d0a71a0695ae42f227e84ac1b1fcaba6da18aabfbdc5986c72eca6
2017-10-31 09:06:24.626061738 +0000 UTC [INFO] VolumeDriver Get: ed48485793f1f5621dbff92463b99ea5ad822c26fafbf205f3ff173bf3157e4d

manager2

2017-10-31 09:00:16.15312973 +0000 UTC [INFO] VolumeDriver Get: c6b9821c5e02c00de7885427fa5593cff5f94fcad41f80a4aaf8a4dd0c00630b
2017-10-31 09:00:43.88508694 +0000 UTC [INFO] VolumeDriver Get: 2583f6f93875845030f20b1a6fe2612e5800352d4d736ff5abef8cdb65d06678
2017-10-31 09:01:45.457576671 +0000 UTC [INFO] VolumeDriver Get: ae0c26458fcfdc4de3ed18f73f39fc3c293b347ea5d2b9e24dab14b18a34f306
2017-10-31 09:02:47.892570486 +0000 UTC [INFO] VolumeDriver Get: 1b0a3b97349b9dcfaf028e32cab8da2e27897c7a09f7d60dee2803492210236d
2017-10-31 09:03:48.60524559 +0000 UTC [INFO] VolumeDriver Get: 2caea1b5c8c31874e8d8826d6905fd2494f4b729e12bae662abe618be05da27c
2017-10-31 09:05:20.956716195 +0000 UTC [INFO] VolumeDriver Get: 41f10a305a34d2ba55e97ae2ae36f8860135bf1da28baca9702589529dab219e
2017-10-31 09:05:48.397540093 +0000 UTC [INFO] VolumeDriver Get: 88dde33159dd7ea6eb978fea03a03cbaae41583934e74fd3838d547f8a70d5b1
2017-10-31 09:06:15.854202428 +0000 UTC [INFO] VolumeDriver Get: bd8a08dfa8014cbb0b51a1d238d29d03f0251b33344677f97cc04c25f3476f7e

manager3

2017-10-31 09:11:54.329242199 +0000 UTC [ERROR] Failed to call ETCD Get for listing all keys with prefix error="context deadline exceeded" prefix="SVOLS_stat_"
2017-10-31 09:11:54.329416049 +0000 UTC [ERROR] Failed to get volume list error="context deadline exceeded"
2017-10-31 09:11:59.581617309 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 09:12:05.624456251 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 09:12:10.750564573 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 09:12:16.217355943 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 09:12:23.70115703 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 09:12:28.826429604 +0000 UTC [ERROR] Failed to call ETCD Get for listing all keys with prefix error="context deadline exceeded" prefix="SVOLS_stat_"
2017-10-31 09:12:28.826588979 +0000 UTC [ERROR] Failed to get volume list error="context deadline exceeded"
2017-10-31 09:12:39.142116964 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 09:12:44.275405478 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 09:12:49.725257798 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 09:12:54.97203895 +0000 UTC [ERROR] Failed to call ETCD Get for listing all keys with prefix error="context deadline exceeded" prefix="SVOLS_stat_"
2017-10-31 09:12:54.972169615 +0000 UTC [ERROR] Failed to get volume list error="context deadline exceeded"
2017-10-31 09:13:03.851710096 +0000 UTC [ERROR] Failed to call ETCD Get for listing all keys with prefix error="context deadline exceeded" prefix="SVOLS_stat_"
2017-10-31 09:13:03.851870749 +0000 UTC [ERROR] Failed to get volume list error="context deadline exceeded"
2017-10-31 09:13:09.537374433 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 09:13:14.752017914 +0000 UTC [ERROR] Failed to call ETCD Get for listing all keys with prefix error="context deadline exceeded" prefix="SVOLS_stat_"
2017-10-31 09:13:14.752153911 +0000 UTC [ERROR] Failed to get volume list error="context deadline exceeded"
2017-10-31 09:13:20.134451492 +0000 UTC [ERROR] Failed to call ETCD Get for listing all keys with prefix error="context deadline exceeded" prefix="SVOLS_stat_"
2017-10-31 09:13:20.134620496 +0000 UTC [ERROR] Failed to get volume list error="context deadline exceeded"

At the moment, manager3 is the Leader of the swarm.

Regards, Bart

luomiao commented 6 years ago

Hi @bartlenaerts Thanks for posting the logs.

Can you post a little bit more for the beginning part of the log on manager3? I'd like to check when the ETCD is going wrong.

Also, from the log of manager1, it seems docker service just got restarted. Has the node been restarted after the plugin installed?

We are currently working on the beta release of vFile which will have more stability and flexibility about the install order of the plugin. But for now, you may want strictly follow the steps below:

  1. Please make sure docker version is larger than 17.06
  2. Please make sure ETCD ports (2379, 2380) are opened on your manager VMs
  3. Install vsphere plugins first on all the nodes, before install the vFile plugins
  4. Disable and remove all previous existing vFile plugins from your VMs
  5. Install vFile plugin on swarm leader node first and on all the other nodes (this should not be a requirement but let's follow this one to debug)
  6. After the vFile plugin installation on each node, please check the log to find the following line:
    2017-10-25 16:54:24.987122121 -0700 PDT [INFO] Going into ServeUnix - Listening on Unix socket address="/run/docker/plugins/vfile.sock"
bartlenaerts commented 6 years ago

Hi @luomiao,

I followed your steps and now it works! So the steps I did were:

  1. Checked Docker version: 17.09
  2. Opened ETCD ports on all 3 managers.
  3. Removed the vSphere and vFile plugins on all nodes.
  4. Restarted Docker on all nodes (just to be sure).
  5. Installed vSphere on all nodes first.
  6. Installed vFile on leader manager only.
  7. Checked the log (see below).
  8. Installed vFile on all other nodes.
  9. I was then able to create a vFile volume. docker volume create --driver=vfile --name=SharedVol -o size=10gb

vfile.log

2017-10-31 10:02:58.34807531 +0000 UTC [INFO] Starting plugin driver=vfile log_level=info config="/etc/vfile.conf"
2017-10-31 10:02:58.348457389 +0000 UTC [INFO] Getting volume data from unix:///var/run/docker.sock
2017-10-31 10:02:58.381721215 +0000 UTC [INFO] Found 0 running or paused containers
2017-10-31 10:02:58.382469762 +0000 UTC [INFO] Discovered 0 volumes that may be in use.
2017-10-31 10:02:58.382499895 +0000 UTC [INFO] Refcounting successfully completed
2017-10-31 10:02:58.382542227 +0000 UTC [INFO] vFile plugin started version="vFile Volume Driver v0.2"
2017-10-31 10:02:58.382620002 +0000 UTC [INFO] Started loading file server image
2017-10-31 10:02:58.384038204 +0000 UTC [INFO] Going into ServeUnix - Listening on Unix socket address="/run/docker/plugins/vfile.sock"
2017-10-31 10:02:58.41946116 +0000 UTC [INFO] Swarm node role: leader, start ETCD cluster nodeID=i5c6sievyy4rggvpzqecsmu7c
2017-10-31 10:02:59.420076507 +0000 UTC [INFO] Checking ETCD client is started
2017-10-31 10:03:09.452428503 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 10:03:14.901041414 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 10:03:19.971432827 +0000 UTC [ERROR] Failed to call ETCD Get for listing all keys with prefix error="context deadline exceeded" prefix="SVOLS_stat_"
2017-10-31 10:03:19.971590919 +0000 UTC [ERROR] Failed to get volume list error="context deadline exceeded"
2017-10-31 10:03:25.005591279 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 10:03:30.079088397 +0000 UTC [ERROR] Failed to call ETCD Get for listing all keys with prefix error="context deadline exceeded" prefix="SVOLS_stat_"
2017-10-31 10:03:30.07923584 +0000 UTC [ERROR] Failed to get volume list error="context deadline exceeded"
2017-10-31 10:03:35.110973839 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 10:03:40.143142735 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 10:03:45.531294864 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 10:03:50.581942072 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 10:03:55.975909884 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 10:04:01.008991757 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 10:04:06.04353414 +0000 UTC [ERROR] Failed to call ETCD Get for listing all keys with prefix error="context deadline exceeded" prefix="SVOLS_stat_"
2017-10-31 10:04:06.043698527 +0000 UTC [ERROR] Failed to get volume list error="context deadline exceeded"
2017-10-31 10:04:11.148425895 +0000 UTC [ERROR] Failed to call ETCD Get for listing all keys with prefix prefix="SVOLS_stat_" error="context deadline exceeded"
2017-10-31 10:04:11.148575098 +0000 UTC [ERROR] Failed to get volume list error="context deadline exceeded"
2017-10-31 10:04:16.182878652 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 10:04:21.217242235 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 10:04:26.252000231 +0000 UTC [ERROR] Failed to call ETCD Get for listing all keys with prefix error="context deadline exceeded" prefix="SVOLS_stat_"
2017-10-31 10:04:26.252117244 +0000 UTC [ERROR] Failed to get volume list error="context deadline exceeded"
2017-10-31 10:04:31.291695968 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 10:04:36.328717021 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 10:04:41.366399264 +0000 UTC [ERROR] Failed to call ETCD Get for listing all keys with prefix error="context deadline exceeded" prefix="SVOLS_stat_"
2017-10-31 10:04:41.366510315 +0000 UTC [ERROR] Failed to get volume list error="context deadline exceeded"
2017-10-31 10:04:46.400956082 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 10:04:51.827967533 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 10:04:56.868578217 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 10:05:02.26486748 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 10:05:07.303951319 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 10:05:12.340111352 +0000 UTC [ERROR] Failed to call ETCD Get for listing all keys with prefix error="context deadline exceeded" prefix="SVOLS_stat_"
2017-10-31 10:05:12.34027219 +0000 UTC [ERROR] Failed to get volume list error="context deadline exceeded"
2017-10-31 10:05:17.429504872 +0000 UTC [WARNING] Failed to get volume states from ETCD due to error context deadline exceeded.
2017-10-31 10:05:22.467149029 +0000 UTC [ERROR] Failed to call ETCD Get for listing all keys with prefix prefix="SVOLS_stat_" error="context deadline exceeded"
2017-10-31 10:05:22.467267836 +0000 UTC [ERROR] Failed to get volume list error="context deadline exceeded"
2017-10-31 10:22:04.283225357 +0000 UTC [INFO] VolumeDriver Get: SharedVol
2017-10-31 10:22:04.318059744 +0000 UTC [INFO] Volume not found: SharedVol
2017-10-31 10:22:04.318226541 +0000 UTC [ERROR] Failed to get volume meta-data name=SharedVol error="No such volume"
2017-10-31 10:22:04.319159223 +0000 UTC [INFO] VolumeDriver Create: SharedVol
2017-10-31 10:22:04.319339815 +0000 UTC [INFO] Attempting to write initial metadata entry for SharedVol
2017-10-31 10:22:04.354541407 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-10-31 10:22:04.3557077 +0000 UTC [INFO] Attempting to create internal volume for SharedVol
2017-10-31 10:22:10.707840759 +0000 UTC [INFO] Attempting to update volume state to ready for volume: SharedVol
2017-10-31 10:22:10.744852291 +0000 UTC [INFO] Successfully created volume: SharedVol

The problem could be one of two:

  1. The ETCD ports were not added to firewalld on the managers.
  2. Because the hosts were created with Terraforms, the vSphere and vFile plugins were automatically installed on the host, before other hosts were available. So the Leader had already those 2 plugins, without there being already extra managers and workers.

Thanks for your help!

Regards, Bart

luomiao commented 6 years ago

@bartlenaerts Great to know it works! The first one might be the reason. We will update the user guide to highlight this requirement. The second one should not be a problem though. And from the log on leader, it seems the leader might not be able to talk to itself's ETCD port? It may be able to talk to other managers though.

thiagolealassis commented 6 years ago

@luomiao swarm is an restriction for my company, are you planning to delivery an orchestration technology independent version ?

luomiao commented 6 years ago

@thiagolealassis We don't have an orchestrator independent plan but we are planning a vFile plugin/provider for Kubernetes in Q1 2018.

bartlenaerts commented 6 years ago

Hi @luomiao,

not still there yet. Sorry... I can create a vFile volume but I still have problems mounting it to a Docker container:

VolumeDriver.Mount: Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete.

Any ideas?

Regards, Bart

luomiao commented 6 years ago

@bartlenaerts There is a recently change in the SMB server which is utilized by vFile, which increased the starting time of a vFile volume to around 1 minute. This could result in the timeout error you saw... Would you please try to increase the VFILE_TIMEOUT_IN_SECOND when you install the plugin: docker plugin install --grant-all-permissions --alias vfile vmware/vfile:latest VFILE_TIMEOUT_IN_SECOND=90

We are currently working on solving this long start time: https://github.com/vmware/docker-volume-vsphere/issues/1954

bartlenaerts commented 6 years ago

Hi @luomiao sorry to keep bothering you, but I've got some new error now... I created the vFile volume and mounted it to my container. I see a vFileServerassets container running. But it comes and goes and changes continuously from ContainerID. Therefore the referencing container can't start and gives these errors:

"Post http://%2Frun%2Fdocker%2Fplugins%2F24ff4b45da04d67a56fd3c3e09ec19cae8657692b37675c129f398e2fe1ea7b8%2Fvfile.sock/VolumeDriver.Mount: dial unix /run/docker/plugins/24ff4b45da04d67a56fd3c3e09ec19cae8657692b37675c129f398e2fe1ea7b8/vfile.sock: connect: no such file or directory"
"task: non-zero exit (1)"

The first error probably because the vFileServer changed it's container ID, the second error while the vFileServer is not available at all.

vfile.log

2017-11-02 13:35:51.92403881 +0000 UTC [INFO] VolumeDriver Get: jobwalkr_assets
2017-11-02 13:35:52.031943042 +0000 UTC [INFO] Volume not found: jobwalkr_assets
2017-11-02 13:35:52.031986049 +0000 UTC [ERROR] Failed to get volume meta-data name="jobwalkr_assets" error="No such volume"
2017-11-02 13:35:52.036334682 +0000 UTC [INFO] VolumeDriver Get: jobwalkr_assets
2017-11-02 13:35:52.093771792 +0000 UTC [INFO] Volume not found: jobwalkr_assets
2017-11-02 13:35:52.093815283 +0000 UTC [ERROR] Failed to get volume meta-data name="jobwalkr_assets" error="No such volume"
2017-11-02 13:35:52.094233375 +0000 UTC [INFO] VolumeDriver Create: jobwalkr_assets
2017-11-02 13:35:52.094356322 +0000 UTC [INFO] Attempting to write initial metadata entry for jobwalkr_assets
2017-11-02 13:35:52.145466086 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:35:52.145727574 +0000 UTC [INFO] Attempting to create internal volume for jobwalkr_assets
2017-11-02 13:36:18.400656062 +0000 UTC [INFO] Attempting to update volume state to ready for volume: jobwalkr_assets
2017-11-02 13:36:18.470265516 +0000 UTC [INFO] Successfully created volume: jobwalkr_assets
2017-11-02 13:37:42.50291006 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:37:43.503513485 +0000 UTC [INFO] Attempting to change volume state to Mounting
2017-11-02 13:37:43.592737218 +0000 UTC [INFO] Volume not in proper state for the operation: Mounting
2017-11-02 13:38:59.099365506 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:39:00.102141058 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:39:01.099632248 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:39:02.099705788 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:39:03.099571491 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:39:04.100088114 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:39:04.168884487 +0000 UTC [INFO] Volume not in proper state for the operation: Ready
2017-11-02 13:39:04.16893153 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:39:05.169707581 +0000 UTC [INFO] Attempting to change volume state to Mounting
2017-11-02 13:39:05.293969382 +0000 UTC [INFO] Volume not in proper state for the operation: Mounting
2017-11-02 13:39:17.59621115 +0000 UTC [INFO] VolumeDriver Get: jobwalkr_assets
2017-11-02 13:39:17.807093978 +0000 UTC [INFO] Mounting volume name="jobwalkr_assets"
2017-11-02 13:39:17.808360544 +0000 UTC [INFO] Directory doesn't exist, creating it path="/mnt/vfile/jobwalkr_assets/"
2017-11-02 13:39:17.808515762 +0000 UTC [INFO] Before AtomicIncr
2017-11-02 13:39:19.025947836 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:39:19.114016528 +0000 UTC [INFO] Service start timeoutvalue=90
2017-11-02 13:40:14.163994607 +0000 UTC [INFO] Add VM[10.11.0.10] with ID[i5c6sievyy4rggvpzqecsmu7c] to ClientList for volume[jobwalkr_assets]
2017-11-02 13:40:14.228384113 +0000 UTC [INFO] Volume state mounted, prepare to mounting locally
2017-11-02 13:40:14.228451221 +0000 UTC [INFO] Get info for mounting name="jobwalkr_assets" Port=30000 ServiceName="vFileServerjobwalkr_assets"
2017-11-02 13:40:14.228483431 +0000 UTC [INFO] Mounting volume with options volume name="jobwalkr_assets" arguments=[-t cifs -o username=root,password=badpass,port=30000,vers=3.0 //127.0.0.1/share1 /mnt/vfile/jobwalkr_assets/]
2017-11-02 13:40:14.228517356 +0000 UTC [INFO] Service start timeoutvalue=90
2017-11-02 13:40:15.458150885 +0000 UTC [INFO] Unmounting Volume name="jobwalkr_assets"
2017-11-02 13:40:15.465774951 +0000 UTC [INFO] Before AtomicDecr
2017-11-02 13:40:16.658644751 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:40:16.738970698 +0000 UTC [INFO] Remove VM[10.11.0.10] with ID[i5c6sievyy4rggvpzqecsmu7c] from ClientList for volume[jobwalkr_assets]
2017-11-02 13:40:17.007109858 +0000 UTC [INFO] Mounting volume name="jobwalkr_assets"
2017-11-02 13:40:17.008335355 +0000 UTC [INFO] Before AtomicIncr
2017-11-02 13:40:18.140162498 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:40:18.199360455 +0000 UTC [INFO] Service start timeoutvalue=90
2017-11-02 13:40:19.252764519 +0000 UTC [INFO] Add VM[10.11.0.10] with ID[i5c6sievyy4rggvpzqecsmu7c] to ClientList for volume[jobwalkr_assets]
2017-11-02 13:40:19.331676249 +0000 UTC [INFO] Volume state mounted, prepare to mounting locally
2017-11-02 13:40:19.331735085 +0000 UTC [INFO] Get info for mounting name="jobwalkr_assets" Port=30000 ServiceName="vFileServerjobwalkr_assets"
2017-11-02 13:40:19.331763975 +0000 UTC [INFO] Mounting volume with options volume name="jobwalkr_assets" arguments=[-t cifs -o username=root,password=badpass,port=30000,vers=3.0 //127.0.0.1/share1 /mnt/vfile/jobwalkr_assets/]
2017-11-02 13:40:19.331802896 +0000 UTC [INFO] Service start timeoutvalue=90
2017-11-02 13:40:20.021045254 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:40:24.789026722 +0000 UTC [INFO] Unmounting Volume name="jobwalkr_assets"
2017-11-02 13:40:24.811283565 +0000 UTC [INFO] Before AtomicDecr
2017-11-02 13:40:26.075377759 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:40:26.15533851 +0000 UTC [INFO] Remove VM[10.11.0.10] with ID[i5c6sievyy4rggvpzqecsmu7c] from ClientList for volume[jobwalkr_assets]
2017-11-02 13:40:27.075666681 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:40:28.07568919 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:40:29.075833466 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:40:30.075741729 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:40:31.075648386 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:40:31.145977948 +0000 UTC [INFO] Volume not in proper state for the operation: Ready
2017-11-02 13:41:07.123652691 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:41:08.12391323 +0000 UTC [INFO] Attempting to change volume state to Mounting
2017-11-02 13:41:08.339085714 +0000 UTC [INFO] Volume not in proper state for the operation: Mounting
2017-11-02 13:41:18.921287077 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:42:20.43542045 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:42:20.72238295 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:42:21.722691798 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:42:22.722697724 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:42:23.722805505 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:42:24.722806464 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:42:25.722770251 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:42:25.822887719 +0000 UTC [INFO] Volume not in proper state for the operation: Ready
2017-11-02 13:42:25.822963757 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:42:26.823220615 +0000 UTC [INFO] Attempting to change volume state to Mounting
2017-11-02 13:42:26.920156237 +0000 UTC [INFO] Volume not in proper state for the operation: Mounting
2017-11-02 13:42:26.920252071 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:43:41.450977382 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:43:43.176891405 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:43:44.177813775 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:43:45.177211784 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:43:46.177277774 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:43:47.177215746 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:43:48.177288179 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:43:48.267026903 +0000 UTC [INFO] Volume not in proper state for the operation: Ready
2017-11-02 13:43:48.267096818 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:43:49.26734785 +0000 UTC [INFO] Attempting to change volume state to Mounting
2017-11-02 13:43:49.358526825 +0000 UTC [INFO] Volume not in proper state for the operation: Mounting
2017-11-02 13:45:00.17391791 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:45:01.174255888 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:45:02.174297397 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:45:03.174409292 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:45:04.174746174 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:45:05.174284417 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:45:05.264105386 +0000 UTC [INFO] Volume not in proper state for the operation: Ready
2017-11-02 13:45:05.264175806 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:45:06.264463974 +0000 UTC [INFO] Attempting to change volume state to Mounting
2017-11-02 13:45:06.424592 +0000 UTC [INFO] Volume not in proper state for the operation: Mounting
2017-11-02 13:45:10.788458674 +0000 UTC [INFO] VolumeDriver Get: 8c0991d45fcb7901dd012529f0cfa3e132195a80643197ca29dfac30fde40231
2017-11-02 13:45:10.83618306 +0000 UTC [INFO] Volume not found: 8c0991d45fcb7901dd012529f0cfa3e132195a80643197ca29dfac30fde40231
2017-11-02 13:45:10.836231719 +0000 UTC [ERROR] Failed to get volume meta-data name=8c0991d45fcb7901dd012529f0cfa3e132195a80643197ca29dfac30fde40231 error="No such volume"
2017-11-02 13:45:53.789808896 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:46:21.267028877 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:46:22.763232793 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:46:24.618890025 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:46:30.30886352 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:46:31.309104588 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:46:32.309188834 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:46:33.309231832 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:46:34.309162756 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:46:35.309209949 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:46:35.397882227 +0000 UTC [INFO] Volume not in proper state for the operation: Ready
2017-11-02 13:46:35.397952838 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:46:36.399738385 +0000 UTC [INFO] Attempting to change volume state to Mounting
2017-11-02 13:46:36.507795265 +0000 UTC [INFO] Volume not in proper state for the operation: Mounting
2017-11-02 13:46:41.700434194 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:47:48.100001694 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:47:48.426759456 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:47:49.427097766 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:47:50.427111969 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:47:51.427104071 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:47:52.427167677 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:47:53.4280003 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:47:53.487290112 +0000 UTC [INFO] Volume not in proper state for the operation: Ready
2017-11-02 13:47:53.48733546 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:47:54.487671041 +0000 UTC [INFO] Attempting to change volume state to Mounting
2017-11-02 13:47:54.630392502 +0000 UTC [INFO] Volume not in proper state for the operation: Mounting
2017-11-02 13:47:54.630519018 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:49:06.247108545 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:49:08.032145704 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:49:09.032471138 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:49:10.03245913 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:49:11.032728282 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:49:12.032522635 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:49:13.032516805 +0000 UTC [INFO] Attempting to change volume state to Unmounting
2017-11-02 13:49:13.137063916 +0000 UTC [INFO] Volume not in proper state for the operation: Ready
2017-11-02 13:49:13.137161852 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:49:14.137418039 +0000 UTC [INFO] Attempting to change volume state to Mounting
2017-11-02 13:49:14.226225668 +0000 UTC [INFO] Volume not in proper state for the operation: Mounting
2017-11-02 13:49:22.188268228 +0000 UTC [INFO] VolumeDriver Get: jobwalkr_assets
2017-11-02 13:49:22.366973981 +0000 UTC [INFO] Mounting volume name="jobwalkr_assets"
2017-11-02 13:49:22.370727446 +0000 UTC [INFO] Before AtomicIncr
2017-11-02 13:49:23.550579102 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT
2017-11-02 13:49:23.630558523 +0000 UTC [INFO] Service start timeoutvalue=90

Any help would be greatly appreciated!

Regards, Bart

luomiao commented 6 years ago

@bartlenaerts vFile is still an experimental feature and we are working to make it stable, so you're very welcome to submit those problems you have met during the deployment! So we can solve them for the beta release :)

Regarding to this error, I think the order might be different from what you thought about. From the log, it looks like the reason why vFile server keeps changing its container ID, is because the user container somehow keeps restarting. vFile plugin is able to mount/umount successfully. However user container hit the error that you have posted: "Post http://%2Frun%2Fdocker%2Fplugins%2F24ff4b45da04d67a56fd3c3e09ec19cae8657692b37675c129f398e2fe1ea7b8%2Fvfile.sock/VolumeDriver.Mount: dial unix /run/docker/plugins/24ff4b45da04d67a56fd3c3e09ec19cae8657692b37675c129f398e2fe1ea7b8/vfile.sock: connect: no such file or directory"

May I ask how did you start your container? Are you starting it by docker service, ordocker run? Another question is, are you using any special permission for your container, i.e, some user permission other than root?

Update: can you please do a docker plugin ls on the node where you saw this error and paste the output here? I think I saw this problem when I tried to re-install the plugin after a volume is created. For example if I created a volume A with vFile plugin on node 1 first, then I removed the vFile plugin and re-install it on this node 1 again, then when I try to use this volume A on node 1, this error can jump out. The container will try to check the old plugin path (24ff4b45da04... in your case) instead of the new plugin path. I am looking into the reason now. But if you can use docker plugin ls to find out the current vFile plugin ID, that would be very helpful.

Thanks!

luomiao commented 6 years ago

I did some tests with other docker plugins, and found out this is an issue not only for us, but for all the plugins if they are installed by "docker plugin install" way. That is, if the plugin is re-installed after a volume is created with this plugin, this volume won't be able to function correctly. This volume is bound to the specific plugin that was removed. @bartlenaerts would you please check when the error happens, does the ID inside the /run/docker/plugins/[ID] error message is matching with the plugin ID you saw through docker plugin ls command? More important, can you verify if the vFile plugin got re-installed after the volume is created?

victormakmo commented 6 years ago

@luomiao i followed your suggest added timeout to 90s. still got same error: Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete.

below is my test lab environment: ESXi 6.0.0 build 2494585 Docker 17.06-CE

2017-11-06 09:41:27.244955838 +0000 UTC [INFO] VolumeDriver Get: SharedVol 2017-11-06 09:41:27.569263652 +0000 UTC [INFO] Mounting volume name=SharedVol 2017-11-06 09:41:27.570079456 +0000 UTC [INFO] Before AtomicIncr 2017-11-06 09:41:28.922490942 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT 2017-11-06 09:41:29.105063978 +0000 UTC [INFO] Service start timeoutvalue=90 2017-11-06 09:41:29.922852498 +0000 UTC [INFO] Attempting to change volume state to Mounting 2017-11-06 09:41:30.133822207 +0000 UTC [INFO] Volume not in proper state for the operation: Error 2017-11-06 09:43:10.370235779 +0000 UTC [ERROR] error="Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete." name=SharedVol 2017-11-06 09:43:10.370442619 +0000 UTC [ERROR] Failed to mount error="Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete." name=SharedVol 2017-11-06 09:43:10.37044535 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT 2017-11-06 09:43:10.370503621 +0000 UTC [INFO] Detaching SharedVol - it is not used anymore 2017-11-06 09:43:11.370916302 +0000 UTC [INFO] Attempting to change volume state to Unmounting 2017-11-06 09:43:11.612027185 +0000 UTC [INFO] Volume not in proper state for the operation: Error

bartlenaerts commented 6 years ago

Hi @luomiao,

because of the weekend, I left it running/trying for the last 4 days, and now it is up and running by itself. Quite strange, I don't know why. Patient is really a virtue.

But to answer your questions:

  1. I installed the service with docker service.
  2. The output of docker plugin ls and the /run/docker/plugins folder is:
    
    drwx------. 2 root root 60 Nov  1 13:20 195cedec9fbc6aac7b80b5453a8db62de26e3d76ee71ef127396556ae9fe97bc
    drwx------. 2 root root 60 Nov  2 09:59 327267222b5332e5ca49eff2ee0d8d06949cb3220aee5b9c8d7cd7607d2e7f1e

ID NAME DESCRIPTION ENABLED 327267222b53 vfile:latest VMWare vFile Docker Volume plugin true 195cedec9fbc vsphere:latest VMWare vSphere Docker Volume plugin true

Looks like both ID's matches. I've checked all 9 workers and managers, and they all have the same ID's. So nothing strange there.

3. I'm sure that the plugin wasn't re-installed after the volume was created.

To sum up, it is working now, but looks like 10 hours ago their still was an error:

[root@manager1 plugins]# docker service ps k4w --no-trunc ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS mjw4593oejsiifqk3jm2vykva vFileServerjobwalkrassets.1 dperson/samba worker6.jobwalkr.com Running Running 10 hours ago fh3hompok1jrgnqkk4g114fyf \ vFileServerjobwalkr_assets.1 dperson/samba worker6.jobwalkr.com Shutdown Failed 10 hours ago "starting container failed: error while mounting volume '': VolumeDriver.Mount: Failed to add disk 'scsi0:1'. disk /vmfs/volumes/VSA_DS1/dockvols/_DEFAULT/_vF_jobwalkrassets.vmdk already attached to VM=5005cd4c-cf85-1dee-8b3b-f8f05268dc7c" 5qjdva8p09zk97tdc8h5jdrkn \ vFileServerjobwalkr_assets.1 dperson/samba worker5.jobwalkr.com Shutdown Failed 10 hours ago "starting container failed: error while mounting volume '/mnt/vmdk/_vF_jobwalkr_assets/': VolumeDriver.Mount: Failed to add disk 'scsi0:1'. disk /vmfs/volumes/VSA_DS1/dockvols/_DEFAULT/_vF_jobwalkrassets.vmdk already attached to VM=5005cd4c-cf85-1dee-8b3b-f8f05268dc7c" fwiy4ly630iz6zae1z8c30r9q \ vFileServerjobwalkr_assets.1 dperson/samba manager2.jobwalkr.com Shutdown Shutdown 10 hours ago

[root@manager1 plugins]# docker service ps 7qp --no-trunc ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS p4gi0o2flhmp1oqfq6o72lslm jobwalkrassets.1 registry.gitlab.com/bamboo3.jobwalkr/bamboo3.jobwalkr.assets:latest@sha256:db2504795b99ba451106d1e03a1853027bd5d708e59af92c4d9f513876069dcf worker2.jobwalkr.com Running Running 10 hours ago znllkdlcbbti3gypza9qozw7f \ jobwalkrassets.1 registry.gitlab.com/bamboo3.jobwalkr/bamboo3.jobwalkr.assets:latest@sha256:db2504795b99ba451106d1e03a1853027bd5d708e59af92c4d9f513876069dcf manager3.jobwalkr.com Shutdown Rejected 11 hours ago "Post http://%2Frun%2Fdocker%2Fplugins%2F24ff4b45da04d67a56fd3c3e09ec19cae8657692b37675c129f398e2fe1ea7b8%2Fvfile.sock/VolumeDriver.Mount: dial unix /run/docker/plugins/24ff4b45da04d67a56fd3c3e09ec19cae8657692b37675c129f398e2fe1ea7b8/vfile.sock: connect: no such file or directory" zzzj874dkegyb4ivvt6d6icgq \ jobwalkrassets.1 registry.gitlab.com/bamboo3.jobwalkr/bamboo3.jobwalkr.assets:latest@sha256:db2504795b99ba451106d1e03a1853027bd5d708e59af92c4d9f513876069dcf worker3.jobwalkr.com Shutdown Rejected 2 days ago "VolumeDriver.Mount: Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete." zz30oqswdppow73ki4ptlb5sj \ jobwalkrassets.1 registry.gitlab.com/bamboo3.jobwalkr/bamboo3.jobwalkr.assets:latest@sha256:db2504795b99ba451106d1e03a1853027bd5d708e59af92c4d9f513876069dcf worker3.jobwalkr.com Shutdown Rejected 2 days ago "VolumeDriver.Mount: Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete." zysowrqx7mzo44pypikgzpqut \ jobwalkr_assets.1 registry.gitlab.com/bamboo3.jobwalkr/bamboo3.jobwalkr.assets:latest@sha256:db2504795b99ba451106d1e03a1853027bd5d708e59af92c4d9f513876069dcf manager3.jobwalkr.com Shutdown Rejected 3 days ago "Post http://%2Frun%2Fdocker%2Fplugins%2F24ff4b45da04d67a56fd3c3e09ec19cae8657692b37675c129f398e2fe1ea7b8%2Fvfile.sock/VolumeDriver.Mount: dial unix /run/docker/plugins/24ff4b45da04d67a56fd3c3e09ec19cae8657692b37675c129f398e2fe1ea7b8/vfile.sock: connect: no such file or directory" nocvvb9fvm0n1yub3fopoc5cj jobwalkrassets.2 registry.gitlab.com/bamboo3.jobwalkr/bamboo3.jobwalkr.assets:latest@sha256:db2504795b99ba451106d1e03a1853027bd5d708e59af92c4d9f513876069dcf worker3.jobwalkr.com Running Running 10 hours ago 5aze09yanlqf1nlomoq7t9hh7 \ jobwalkrassets.2 registry.gitlab.com/bamboo3.jobwalkr/bamboo3.jobwalkr.assets:latest@sha256:db2504795b99ba451106d1e03a1853027bd5d708e59af92c4d9f513876069dcf manager3.jobwalkr.com Shutdown Shutdown 10 hours ago svw8qh4g6lbgmcpc0702p4a4h \ jobwalkrassets.2 registry.gitlab.com/bamboo3.jobwalkr/bamboo3.jobwalkr.assets:latest@sha256:db2504795b99ba451106d1e03a1853027bd5d708e59af92c4d9f513876069dcf manager3.jobwalkr.com Shutdown Shutdown 10 hours ago 0yj8jilv7e8dqq0gsm4s96bkk \ jobwalkrassets.2 registry.gitlab.com/bamboo3.jobwalkr/bamboo3.jobwalkr.assets:latest@sha256:db2504795b99ba451106d1e03a1853027bd5d708e59af92c4d9f513876069dcf manager3.jobwalkr.com Shutdown Shutdown 10 hours ago zzt4b8avuvsx7g8wguxdz30qj \ jobwalkr_assets.2 registry.gitlab.com/bamboo3.jobwalkr/bamboo3.jobwalkr.assets:latest@sha256:db2504795b99ba451106d1e03a1853027bd5d708e59af92c4d9f513876069dcf worker6.jobwalkr.com Shutdown Rejected 2 days ago "VolumeDriver.Mount: Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete."


This is the vfile.log on **worker2.jobwalkr.com**:

2017-11-03 17:25:04.186734565 +0000 UTC [INFO] VolumeDriver Get: jobwalkr_assets 2017-11-03 17:25:04.698565343 +0000 UTC [INFO] Mounting volume name="jobwalkr_assets" 2017-11-03 17:25:04.699166541 +0000 UTC [INFO] Directory doesn't exist, creating it path="/mnt/vfile/jobwalkr_assets/" 2017-11-03 17:25:04.699274603 +0000 UTC [INFO] Before AtomicIncr 2017-11-03 17:25:06.058027556 +0000 UTC [INFO] Service start timeoutvalue=90 2017-11-03 17:26:19.196644801 +0000 UTC [INFO] Add VM[10.11.0.14] with ID[0m4raijxcehioq6yov27fauu7] to ClientList for volume[jobwalkr_assets] 2017-11-03 17:26:19.331361115 +0000 UTC [INFO] Volume state mounted, prepare to mounting locally 2017-11-03 17:26:19.331470891 +0000 UTC [INFO] Get info for mounting name="jobwalkr_assets" Port=30000 ServiceName="vFileServerjobwalkr_assets" 2017-11-03 17:26:19.331506565 +0000 UTC [INFO] Mounting volume with options volume name="jobwalkr_assets" arguments=[-t cifs -o username=root,password=badpass,port=30000,vers=3.0 //127.0.0.1/share1 /mnt/vfile/jobwalkr_assets/] 2017-11-03 17:26:19.331550771 +0000 UTC [INFO] Service start timeoutvalue=90 2017-11-03 17:26:20.651574113 +0000 UTC [INFO] Unmounting Volume name="jobwalkr_assets" 2017-11-03 17:26:20.667908388 +0000 UTC [INFO] Before AtomicDecr 2017-11-03 17:26:22.089856726 +0000 UTC [INFO] Remove VM[10.11.0.14] with ID[0m4raijxcehioq6yov27fauu7] from ClientList for volume[jobwalkr_assets] 2017-11-03 17:26:22.509861826 +0000 UTC [INFO] Mounting volume name="jobwalkr_assets" 2017-11-03 17:26:22.510411346 +0000 UTC [INFO] Before AtomicIncr 2017-11-03 17:26:23.874220336 +0000 UTC [INFO] Service start timeoutvalue=90 2017-11-03 17:26:24.9488618 +0000 UTC [INFO] Add VM[10.11.0.14] with ID[0m4raijxcehioq6yov27fauu7] to ClientList for volume[jobwalkr_assets] 2017-11-03 17:26:25.030445419 +0000 UTC [INFO] Volume state mounted, prepare to mounting locally 2017-11-03 17:26:25.030502655 +0000 UTC [INFO] Get info for mounting name="jobwalkr_assets" Port=30000 ServiceName="vFileServerjobwalkr_assets" 2017-11-03 17:26:25.030545702 +0000 UTC [INFO] Mounting volume with options arguments=[-t cifs -o username=root,password=badpass,port=30000,vers=3.0 //127.0.0.1/share1 /mnt/vfile/jobwalkr_assets/] volume name="jobwalkr_assets" 2017-11-03 17:26:25.030569989 +0000 UTC [INFO] Service start timeoutvalue=90 2017-11-03 17:26:36.673009232 +0000 UTC [INFO] Unmounting Volume name="jobwalkr_assets" 2017-11-03 17:26:36.690330221 +0000 UTC [INFO] Before AtomicDecr 2017-11-03 17:26:38.0675721 +0000 UTC [INFO] Remove VM[10.11.0.14] with ID[0m4raijxcehioq6yov27fauu7] from ClientList for volume[jobwalkr_assets] 2017-11-06 05:15:17.629489491 +0000 UTC [INFO] VolumeDriver Get: jobwalkr_assets 2017-11-06 05:15:19.869340218 +0000 UTC [WARNING] Transactional metadata read failed: context deadline exceeded.Swarm cluster maybe unhealthy 2017-11-06 05:15:19.869553049 +0000 UTC [WARNING] Failed to read metadata for volume jobwalkr_assets from KV store. Transactional metadata read failed: context deadline exceeded.Swarm cluster maybe unhealthy 2017-11-06 05:15:19.8695957 +0000 UTC [ERROR] Failed to get volume meta-data name="jobwalkr_assets" error="Failed to read metadata for volume jobwalkr_assets from KV store. Transactional metadata read failed: context deadline exceeded.Swar$ 2017-11-06 05:15:20.636203793 +0000 UTC [INFO] VolumeDriver Get: jobwalkr_assets 2017-11-06 05:15:22.86066109 +0000 UTC [WARNING] Transactional metadata read failed: context deadline exceeded.Swarm cluster maybe unhealthy 2017-11-06 05:15:22.860831554 +0000 UTC [WARNING] Failed to read metadata for volume jobwalkr_assets from KV store. Transactional metadata read failed: context deadline exceeded.Swarm cluster maybe unhealthy 2017-11-06 05:15:22.860883742 +0000 UTC [ERROR] Failed to get volume meta-data name="jobwalkr_assets" error="Failed to read metadata for volume jobwalkr_assets from KV store. Transactional metadata read failed: context deadline exceeded.Sw$


vfile.log **worker3.jobwalkr.com**:

2017-11-05 15:18:13.717954638 +0000 UTC [INFO] Detaching jobwalkr_assets - it is not used anymore 2017-11-06 04:22:01.103111451 +0000 UTC [INFO] VolumeDriver Get: 8256f80592eb00512ee30c7dff0755e8ac0c42f4b7eb24dc8983a4154f66297e 2017-11-06 04:22:01.534985875 +0000 UTC [INFO] Volume not found: 8256f80592eb00512ee30c7dff0755e8ac0c42f4b7eb24dc8983a4154f66297e 2017-11-06 04:22:01.535043188 +0000 UTC [ERROR] Failed to get volume meta-data name=8256f80592eb00512ee30c7dff0755e8ac0c42f4b7eb24dc8983a4154f66297e error="No such volume" 2017-11-06 04:35:56.088914427 +0000 UTC [INFO] VolumeDriver Get: jobwalkr_assets 2017-11-06 04:35:56.586154029 +0000 UTC [INFO] GetClientList key=SVOLS_client_jobwalkr_assets_i5c6sievyy4rggvpzqecsmu7c IP=10.11.0.10 2017-11-06 04:35:56.64786963 +0000 UTC [INFO] Mounting volume name="jobwalkr_assets" 2017-11-06 04:35:56.6518307 +0000 UTC [INFO] Before AtomicIncr 2017-11-06 04:35:58.201107861 +0000 UTC [INFO] Service start timeoutvalue=90 2017-11-06 04:37:39.884242468 +0000 UTC [ERROR] name="jobwalkr_assets" error="Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete." 2017-11-06 04:37:39.884348016 +0000 UTC [ERROR] Failed to mount error="Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete." name="jobwalkr_assets" 2017-11-06 04:37:39.884463786 +0000 UTC [INFO] Detaching jobwalkr_assets - it is not used anymore 2017-11-06 05:06:30.846500667 +0000 UTC [INFO] VolumeDriver Get: jobwalkr_assets 2017-11-06 05:06:31.353015513 +0000 UTC [INFO] GetClientList key=SVOLS_client_jobwalkr_assets_i5c6sievyy4rggvpzqecsmu7c IP=10.11.0.10 2017-11-06 05:06:31.365055099 +0000 UTC [INFO] Mounting volume name="jobwalkr_assets" 2017-11-06 05:06:31.366050955 +0000 UTC [INFO] Before AtomicIncr 2017-11-06 05:06:33.246069777 +0000 UTC [INFO] Service start timeoutvalue=90 2017-11-06 05:08:14.998695306 +0000 UTC [ERROR] name="jobwalkr_assets" error="Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete." 2017-11-06 05:08:14.998858585 +0000 UTC [ERROR] Failed to mount error="Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete." name="jobwalkr_assets" 2017-11-06 05:08:14.998925399 +0000 UTC [INFO] Detaching jobwalkr_assets - it is not used anymore 2017-11-06 05:08:15.413549092 +0000 UTC [INFO] VolumeDriver Get: jobwalkr_assets 2017-11-06 05:08:15.794082745 +0000 UTC [INFO] GetClientList key=SVOLS_client_jobwalkr_assets_i5c6sievyy4rggvpzqecsmu7c IP=10.11.0.10 2017-11-06 05:08:15.822464104 +0000 UTC [INFO] Mounting volume name="jobwalkr_assets" 2017-11-06 05:08:15.823282247 +0000 UTC [INFO] Before AtomicIncr 2017-11-06 05:08:17.530279536 +0000 UTC [INFO] Service start timeoutvalue=90 2017-11-06 05:09:59.093448824 +0000 UTC [ERROR] name="jobwalkr_assets" error="Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete." 2017-11-06 05:09:59.093635368 +0000 UTC [ERROR] Failed to mount name="jobwalkr_assets" error="Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete." 2017-11-06 05:09:59.093698402 +0000 UTC [INFO] Detaching jobwalkr_assets - it is not used anymore 2017-11-06 05:10:40.977900957 +0000 UTC [INFO] VolumeDriver Get: jobwalkr_assets 2017-11-06 05:10:41.583115473 +0000 UTC [INFO] GetClientList key=SVOLS_client_jobwalkr_assets_i5c6sievyy4rggvpzqecsmu7c IP=10.11.0.10 2017-11-06 05:10:41.620409472 +0000 UTC [INFO] Mounting volume name="jobwalkr_assets" 2017-11-06 05:10:41.620855147 +0000 UTC [INFO] Before AtomicIncr 2017-11-06 05:10:43.373549271 +0000 UTC [INFO] Service start timeoutvalue=90 2017-11-06 05:12:25.011539027 +0000 UTC [ERROR] name="jobwalkr_assets" error="Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete." 2017-11-06 05:12:25.011684317 +0000 UTC [ERROR] Failed to mount name="jobwalkr_assets" error="Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete." 2017-11-06 05:12:25.011731325 +0000 UTC [INFO] Detaching jobwalkr_assets - it is not used anymore 2017-11-06 05:12:25.462510444 +0000 UTC [INFO] VolumeDriver Get: jobwalkr_assets 2017-11-06 05:12:25.933857928 +0000 UTC [INFO] GetClientList key=SVOLS_client_jobwalkr_assets_i5c6sievyy4rggvpzqecsmu7c IP=10.11.0.10 2017-11-06 05:12:25.946157591 +0000 UTC [INFO] Mounting volume name="jobwalkr_assets" 2017-11-06 05:12:25.946933716 +0000 UTC [INFO] Before AtomicIncr 2017-11-06 05:12:27.648751252 +0000 UTC [INFO] Service start timeoutvalue=90 2017-11-06 05:14:09.149162863 +0000 UTC [ERROR] name="jobwalkr_assets" error="Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete." 2017-11-06 05:14:09.149223041 +0000 UTC [ERROR] Failed to mount name="jobwalkr_assets" error="Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete." 2017-11-06 05:14:09.14924629 +0000 UTC [INFO] Detaching jobwalkr_assets - it is not used anymore 2017-11-06 05:14:09.651124988 +0000 UTC [INFO] VolumeDriver Get: jobwalkr_assets 2017-11-06 05:14:10.206183491 +0000 UTC [INFO] GetClientList key=SVOLS_client_jobwalkr_assets_i5c6sievyy4rggvpzqecsmu7c IP=10.11.0.10 2017-11-06 05:14:10.218752236 +0000 UTC [INFO] Mounting volume name="jobwalkr_assets" 2017-11-06 05:14:10.219576335 +0000 UTC [INFO] Before AtomicIncr 2017-11-06 05:14:11.974259547 +0000 UTC [INFO] Service start timeoutvalue=90 2017-11-06 05:14:24.974870418 +0000 UTC [ERROR] Failed to compare and get from ETCD error="context deadline exceeded" key="SVOLS_stat_jobwalkr_assets" value=Mounted new key="SVOLS_info_jobwalkr_assets" 2017-11-06 05:14:28.201925421 +0000 UTC [ERROR] Failed to Get key-value from ETCD key="SVOLS_gref_jobwalkr_assets" error="context deadline exceeded" 2017-11-06 05:14:28.202092152 +0000 UTC [ERROR] name="jobwalkr_assets" error="Failed to blocking wait for Mounted state. Error: context deadline exceeded. Also failed to decrease global refcount. Error: context deadline exceeded." 2017-11-06 05:14:28.20213657 +0000 UTC [ERROR] Failed to mount name="jobwalkr_assets" error="Failed to blocking wait for Mounted state. Error: context deadline exceeded. Also failed to decrease global refcount. Error: context deadline exce$ 2017-11-06 05:14:28.202170672 +0000 UTC [INFO] Detaching jobwalkr_assets - it is not used anymore 2017-11-06 05:14:46.861775301 +0000 UTC [INFO] VolumeDriver Get: jobwalkr_assets 2017-11-06 05:14:49.067818778 +0000 UTC [WARNING] Transactional metadata read failed: context deadline exceeded.Swarm cluster maybe unhealthy 2017-11-06 05:14:49.067976012 +0000 UTC [WARNING] Failed to read metadata for volume jobwalkr_assets from KV store. Transactional metadata read failed: context deadline exceeded.Swarm cluster maybe unhealthy 2017-11-06 05:14:49.068008503 +0000 UTC [ERROR] Failed to get volume meta-data name="jobwalkr_assets" error="Failed to read metadata for volume jobwalkr_assets from KV store. Transactional metadata read failed: context deadline exceeded.Sw$ 2017-11-06 05:14:49.339188432 +0000 UTC [INFO] VolumeDriver Get: jobwalkr_assets 2017-11-06 05:14:51.525956223 +0000 UTC [WARNING] Transactional metadata read failed: context deadline exceeded.Swarm cluster maybe unhealthy 2017-11-06 05:14:51.526097009 +0000 UTC [WARNING] Failed to read metadata for volume jobwalkr_assets from KV store. Transactional metadata read failed: context deadline exceeded.Swarm cluster maybe unhealthy 2017-11-06 05:14:51.52616641 +0000 UTC [ERROR] Failed to get volume meta-data name="jobwalkr_assets" error="Failed to read metadata for volume jobwalkr_assets from KV store. Transactional metadata read failed: context deadline exceeded.Swarm cluster maybe unhealthy"


vfile.log **worker6.jobwalkr.com**:

2017-11-06 05:03:55.907482864 +0000 UTC [INFO] VolumeDriver Get: jobwalkr_assets 2017-11-06 05:03:56.419162813 +0000 UTC [INFO] GetClientList key=SVOLS_client_jobwalkr_assets_i5c6sievyy4rggvpzqecsmu7c IP=10.11.0.10 2017-11-06 05:03:56.42897166 +0000 UTC [INFO] Mounting volume name="jobwalkr_assets" 2017-11-06 05:03:56.42943645 +0000 UTC [INFO] Before AtomicIncr 2017-11-06 05:03:58.080514435 +0000 UTC [INFO] Service start timeoutvalue=90 2017-11-06 05:05:39.667608081 +0000 UTC [ERROR] name="jobwalkr_assets" error="Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete." 2017-11-06 05:05:39.667821596 +0000 UTC [ERROR] Failed to mount name="jobwalkr_assets" error="Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete." 2017-11-06 05:05:39.668076077 +0000 UTC [INFO] Detaching jobwalkr_assets - it is not used anymore 2017-11-06 05:07:23.210689729 +0000 UTC [INFO] VolumeDriver Get: jobwalkr_assets 2017-11-06 05:07:23.629129868 +0000 UTC [INFO] GetClientList key=SVOLS_client_jobwalkr_assets_i5c6sievyy4rggvpzqecsmu7c IP=10.11.0.10 2017-11-06 05:07:23.639076698 +0000 UTC [INFO] Mounting volume name="jobwalkr_assets" 2017-11-06 05:07:23.639430395 +0000 UTC [INFO] Before AtomicIncr 2017-11-06 05:07:25.102044469 +0000 UTC [INFO] Service start timeoutvalue=90 2017-11-06 05:09:06.693201812 +0000 UTC [ERROR] name="jobwalkr_assets" error="Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete." 2017-11-06 05:09:06.693279418 +0000 UTC [ERROR] Failed to mount error="Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete." name="jobwalkr_assets" 2017-11-06 05:09:06.693305351 +0000 UTC [INFO] Detaching jobwalkr_assets - it is not used anymore 2017-11-06 05:45:14.916418219 +0000 UTC [INFO] VolumeDriver Get: 1f8d27a94f176efadcabf29c5dd1ff723149da46a5bc7046ac32fef06bfa470b 2017-11-06 05:45:17.457772465 +0000 UTC [INFO] Volume not found: 1f8d27a94f176efadcabf29c5dd1ff723149da46a5bc7046ac32fef06bfa470b 2017-11-06 05:45:17.457871058 +0000 UTC [ERROR] Failed to get volume meta-data error="No such volume" name=1f8d27a94f176efadcabf29c5dd1ff723149da46a5bc7046ac32fef06bfa470b 2017-11-06 05:46:08.659363845 +0000 UTC [INFO] VolumeDriver Get: b111c63cace7db055ece80c2d2d1537d0b5002317f929d5b6a62ff45b6ce0ec0 2017-11-06 05:46:08.993500778 +0000 UTC [INFO] Volume not found: b111c63cace7db055ece80c2d2d1537d0b5002317f929d5b6a62ff45b6ce0ec0 2017-11-06 05:46:08.993570843 +0000 UTC [ERROR] Failed to get volume meta-data name=b111c63cace7db055ece80c2d2d1537d0b5002317f929d5b6a62ff45b6ce0ec0 error="No such volume"



I'll follow it up the next days, and let you know if anything changes. If I still can provide you with any input, just let me know and I'll post it.

Regards,
Bart
luomiao commented 6 years ago

@victormakmo would you please check if ETCD ports (2379, 2380) are opened on your manager VMs? Also may I ask if you have your VMs residing on the same ESX or there are multiple ESXs involved? The third thing you may want to try is to increase the timeout value to 3 minutes and try... I know this is a too large timeout, but due to the known slow-starting issue of the SMB server, we need this large timeout to identify if this is due to slowness or due to a problem in the code. Thanks!

luomiao commented 6 years ago

@bartlenaerts

Thank you for the log! In fact the output of the path under /var/run/docker/plugins will always match with your current output of docker plugin ls. However, the error you saw previously has a different ID with it: "Post http://%2Frun%2Fdocker%2Fplugins%2F24ff4b45da04d67a56fd3c3e09ec19cae8657692b37675c129f398e2fe1ea7b8%2Fvfile.sock/VolumeDriver.Mount: dial unix /run/docker/plugins/24ff4b45da04d67a56fd3c3e09ec19cae8657692b37675c129f398e2fe1ea7b8/vfile.sock: connect: no such file or directory"

vs.

drwx------. 2 root root 60 Nov 2 09:59 327267222b5332e5ca49eff2ee0d8d06949cb3220aee5b9c8d7cd7607d2e7f1e

The error message indicates that there was a vFile plugin with ID 24ff4b45da04d67a56fd3c3e09ec19cae8657692b37675c129f398e2fe1ea7b8 installed on this node before, and the volume to be mounted was created with that plugin. But now the plugin has a new ID 327267222b5332e5ca49eff2ee0d8d06949cb3220aee5b9c8d7cd7607d2e7f1e.

So just to double check, is this docker plugin ls output from your manager3? If it's indeed from manager3 also, then the volume that you are trying to use with this docker service must be created with the plugin ID 24ff4b45da0 but not with current one whose ID is 327267222b.

victormakmo commented 6 years ago

@luomiao In my test lab, all firewall was disabled and all swarm nodes are running in the same ESX host. After increase the timeout to 3 minutes. i got the same error. Please find from below for the logs:

2017-11-07 13:53:30.989347932 +0000 UTC [INFO] No config file found. Using defaults. 2017-11-07 13:53:30.9895395 +0000 UTC [INFO] Starting plugin log_level=info config="/etc/vfile.conf" group=root driver=vfile 2017-11-07 13:53:30.989589229 +0000 UTC [INFO] Getting volume data from unix:///var/run/docker.sock 2017-11-07 13:53:31.124138382 +0000 UTC [INFO] Found 5 running or paused containers 2017-11-07 13:53:31.147626922 +0000 UTC [INFO] Discovered 0 volumes that may be in use. 2017-11-07 13:53:31.14775396 +0000 UTC [INFO] Refcounting successfully completed 2017-11-07 13:53:31.14788615 +0000 UTC [INFO] vFile plugin started version="vFile Volume Driver v0.2" 2017-11-07 13:53:31.148270247 +0000 UTC [INFO] Going into ServeUnix - Listening on Unix socket address="/run/docker/plugins/vfile.sock" 2017-11-07 13:53:31.149153576 +0000 UTC [INFO] Started loading file server image 2017-11-07 13:53:31.338950601 +0000 UTC [INFO] Swarm node role: leader, start ETCD cluster nodeID=8e59vsif5iwrl6nvqfnsrg7zg 2017-11-07 13:53:31.339032261 +0000 UTC [INFO] startEtcdCluster on node with nodeID 8e59vsif5iwrl6nvqfnsrg7zg and nodeAddr 192.168.29.211 2017-11-07 13:53:32.345708809 +0000 UTC [INFO] Checking ETCD client is started 2017-11-07 13:53:32.349742967 +0000 UTC [INFO] Local ETCD client is up successfully, start watcher 2017-11-07 13:56:21.756146376 +0000 UTC [INFO] VolumeDriver Get: SharedVol 2017-11-07 13:56:21.886487575 +0000 UTC [INFO] Volume not found: SharedVol 2017-11-07 13:56:21.886612773 +0000 UTC [ERROR] Failed to get volume meta-data error="No such volume" name=SharedVol 2017-11-07 13:56:21.888746219 +0000 UTC [INFO] VolumeDriver Create: SharedVol 2017-11-07 13:56:21.889040571 +0000 UTC [INFO] Attempting to write initial metadata entry for SharedVol 2017-11-07 13:56:22.01324786 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT 2017-11-07 13:56:22.019989643 +0000 UTC [INFO] Attempting to create internal volume for SharedVol 2017-11-07 13:56:22.167972059 +0000 UTC [INFO] Attempting to update volume state to ready for volume: SharedVol 2017-11-07 13:56:22.295935845 +0000 UTC [INFO] Successfully created volume: SharedVol 2017-11-07 13:56:52.623585188 +0000 UTC [INFO] VolumeDriver Get: SharedVol 2017-11-07 13:56:52.954791068 +0000 UTC [INFO] Mounting volume name=SharedVol 2017-11-07 13:56:52.955906016 +0000 UTC [INFO] Directory doesn't exist, creating it path="/mnt/vfile/SharedVol/" 2017-11-07 13:56:52.956073745 +0000 UTC [INFO] Before AtomicIncr 2017-11-07 13:56:54.219516751 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT 2017-11-07 13:56:54.339057191 +0000 UTC [INFO] Service start timeoutvalue=180 2017-11-07 13:56:55.237721777 +0000 UTC [INFO] Attempting to change volume state to Mounting 2017-11-07 13:56:55.551156275 +0000 UTC [INFO] Volume not in proper state for the operation: Mounting 2017-11-07 14:00:05.581419353 +0000 UTC [INFO] Watcher on global refcount returns event type=PUT 2017-11-07 14:00:05.584248142 +0000 UTC [ERROR] name=SharedVol error="Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete." 2017-11-07 14:00:05.584385601 +0000 UTC [ERROR] Failed to mount name=SharedVol error="Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete." 2017-11-07 14:00:05.584418224 +0000 UTC [INFO] Detaching SharedVol - it is not used anymore 2017-11-07 14:00:06.581843797 +0000 UTC [INFO] Attempting to change volume state to Unmounting 2017-11-07 14:00:06.752591527 +0000 UTC [INFO] Volume not in proper state for the operation: Error

luomiao commented 6 years ago

@victormakmo Can you please do the following: After you start the container, within the 3 minutes timeout gap, from one of your swarm manager node, please do docker service ls and find the ID of the service starting with name vFileServer. Then do docker service ps [ID of the vFileServer service] --no-trunc and post the output here.

Thanks.

victormakmo commented 6 years ago

@luomiao

I start up a container use below command: docker run --rm -it -v SharedVol:/mnt/myvol --name busybox-on-node1 busybox

service ps:

ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS nfh6ls8fazqy6g90xehwq6s9m vFileServerSharedVol.1 dperson/samba poc-docker-app02.test.local Running Starting 53 seconds ago
lsq0ks6d8z2a5y5por1s62lyg _ vFileServerSharedVol.1 dperson/samba poc-docker-app02.test.local Shutdown Complete 58 seconds ago
vxbz5t9jsoy5voxo5gn8oeon4 _ vFileServerSharedVol.1 dperson/samba poc-docker-app02.test.local Shutdown Complete 4 minutes ago
twz78bi1jfb7f5q187r3o86b4 _ vFileServerSharedVol.1 dperson/samba poc-docker-app03.test.local Shutdown Complete 8 minutes ago
x1hiso21qex28vgy6uxe1stjt _ vFileServerSharedVol.1 dperson/samba poc-docker-app02.test.local Shutdown Complete 12 minutes ago

service log:

vFileServerSharedVol.1.twz78bi1jfb7@poc-docker-app03.test.local | adduser: user 'root' in use vFileServerSharedVol.1.twz78bi1jfb7@poc-docker-app03.test.local | Added user root. vFileServerSharedVol.1.nfh6ls8fazqy@poc-docker-app02.test.local | adduser: user 'root' in use vFileServerSharedVol.1.vxbz5t9jsoy5@poc-docker-app02.test.local | adduser: user 'root' in use vFileServerSharedVol.1.x1hiso21qex2@poc-docker-app02.test.local | adduser: user 'root' in use vFileServerSharedVol.1.twz78bi1jfb7@poc-docker-app03.test.local | smbd version 4.6.8 started. vFileServerSharedVol.1.twz78bi1jfb7@poc-docker-app03.test.local | Copyright Andrew Tridgell and the Samba Team 1992-2017 vFileServerSharedVol.1.twz78bi1jfb7@poc-docker-app03.test.local | STATUS=daemon 'smbd' finished starting up and ready to serve connections vFileServerSharedVol.1.lsq0ks6d8z2a@poc-docker-app02.test.local | adduser: user 'root' in use vFileServerSharedVol.1.x1hiso21qex2@poc-docker-app02.test.local | Added user root. vFileServerSharedVol.1.nfh6ls8fazqy@poc-docker-app02.test.local | Added user root. vFileServerSharedVol.1.vxbz5t9jsoy5@poc-docker-app02.test.local | Added user root. vFileServerSharedVol.1.lsq0ks6d8z2a@poc-docker-app02.test.local | Added user root. vFileServerSharedVol.1.vxbz5t9jsoy5@poc-docker-app02.test.local | smbd version 4.6.8 started. vFileServerSharedVol.1.lsq0ks6d8z2a@poc-docker-app02.test.local | smbd version 4.6.8 started. vFileServerSharedVol.1.x1hiso21qex2@poc-docker-app02.test.local | smbd version 4.6.8 started. vFileServerSharedVol.1.nfh6ls8fazqy@poc-docker-app02.test.local | smbd version 4.6.8 started. vFileServerSharedVol.1.lsq0ks6d8z2a@poc-docker-app02.test.local | Copyright Andrew Tridgell and the Samba Team 1992-2017 vFileServerSharedVol.1.vxbz5t9jsoy5@poc-docker-app02.test.local | Copyright Andrew Tridgell and the Samba Team 1992-2017 vFileServerSharedVol.1.x1hiso21qex2@poc-docker-app02.test.local | Copyright Andrew Tridgell and the Samba Team 1992-2017 vFileServerSharedVol.1.nfh6ls8fazqy@poc-docker-app02.test.local | Copyright Andrew Tridgell and the Samba Team 1992-2017 vFileServerSharedVol.1.nfh6ls8fazqy@poc-docker-app02.test.local | STATUS=daemon 'smbd' finished starting up and ready to serve connections vFileServerSharedVol.1.x1hiso21qex2@poc-docker-app02.test.local | STATUS=daemon 'smbd' finished starting up and ready to serve connections vFileServerSharedVol.1.vxbz5t9jsoy5@poc-docker-app02.test.local | STATUS=daemon 'smbd' finished starting up and ready to serve connections vFileServerSharedVol.1.lsq0ks6d8z2a@poc-docker-app02.test.local | STATUS=daemon 'smbd' finished starting up and ready to serve connections

bartlenaerts commented 6 years ago

Hi @luomiao,

This it the output for docker plugin ls from manager3:

ID                  NAME                DESCRIPTION                           ENABLED
ac7712db2e71        vfile:latest        VMWare vFile Docker Volume plugin     true
fc58f64a4ff6        vsphere:latest      VMWare vSphere Docker Volume plugin   true

and the list in /run/docker/plugins

drwx------. 2 root root 60 Nov  7 10:18 ac7712db2e71d01515fbec64a750c577ce4999b7243d18b8c3c204aaa30ced64
drwx------. 2 root root 60 Nov  7 10:18 fc58f64a4ff69408c721965dabc6ee4f6d69630f129ef500ac82338b64b40e76

Something probably went bad because of all the testing I did. Removing plugins, re-installing them...

I'll try again and let you know how it went!

Regards, Bart

luomiao commented 6 years ago

@victormakmo From the log, it seems the servers are up correctly. But somehow the node where the container is running now cannot detect the server is up. This usually can be caused by two reasons:

  1. ETCD port is not open. Even firewall is disabled, sometimes it doesn't mean the ports are all open, especially if the VM is a photon os...
  2. Swarm routing mesh is not working: to verify this, do a simple test by running command: docker service create --replicas 1 -p 8080:80 --name web nginx on one of the manager nodes. Then running curl 127.0.0.1:8080 on every nodes to verify if all the nodes can get response correctly.

If you still see the problem after the two checks above, we can set up a conference call to help debugging the problem :)

bartlenaerts commented 6 years ago

Hi @luomiao,

I cleaned up everything yesterday, so removed all services, all volumes, all images and containers... Today I reinstalled my first service again, and it worked like a charm right out of the box!

Conclusion to me is, make sure the ETCD ports are open, and that you install the vFile driver with the option VFILE_TIMEOUT_IN_SECOND=90. And of course have a clean Docker swarm with no dangling issues.

Thanks for all your help!

Regards, Bart

victormakmo commented 6 years ago

hi @luomiao,

  1. For ETCD connection, i have checked on all managers use netstat, all the managers have established connection with port 2379, 2380. i assume the ETCD is up and running

  2. Swarm routing also no problem. all managers can get response on curl.

If you have time, we may set up a conference call. :-)

luomiao commented 6 years ago

Hi @victormakmo sorry I didn't find your email address from github. Would you please send an email to me at: miaol@vmware.com so we can set up the conference call?

@bartlenaerts Glad to know things are working! You are welcome to try out the next release which will be available at the beginning of December. If everything is good for now, would you please close this issue? We will open another one for @victormakmo since it seems a different issue. Thank you!