Closed mkboudreau closed 6 years ago
The file descriptors being owned by the Docker daemon would make this a Docker side issue. If the vSphere volume driver is a separate process all together and if its opening files (AFAIK, the plugin only opens the VMCI socket to make calls into ESX and that path has been in use all along).
Can we try an lsof -p
For logs, pls. set "debug' in the plugin config file and restart the plugin.
Hey @govint
https://github.com/vmware/vsphere-storage-for-docker/issues/2073#issuecomment-367782421 For logs, pls. set "debug' in the plugin config file and restart the plugin.
Is there any regression that you know of? plugin restart is not able to parse the plugin config file.
Regarding logs... we've been struggling to get it working from the config file, even though we've been told to do it that way. The config.json-template file seems to set the VDVS_LOG_LEVEL
env var and it appears that if that var is set, then the log level from the config file would never be considered (see config.go). Am I understanding this correctly?
config.json-template: https://github.com/vmware/vsphere-storage-for-docker/blob/master/plugin_dockerbuild/config.json-template
Enabled debug logging with the following commands:
docker plugin disable -f vsphere
docker plugin set vsphere VDVS_LOG_LEVEL="debug"
docker plugin enable vsphere
@mkboudreau, @bteichner, can you'll confirm what the issue is with the docker daemon, given the observed behavior. Do you'll wish to keep this issue open here as we presently don't see any bug in the volume plugin to fix.
Since we turned debug logging on, we have not encountered the issue. It has always taken between 2 and 15 days between incidents. Please keep the issue open a little longer. I would really like to see this happen while we have debug logging turned on.
@mkboudreau, sure no problem.
@mkboudreau can you'll update on this issue, can we close if there aren't any more updates.
Thank you for following up. Go ahead and close and I can always reopen it if needed.
Closing.
Hi @govint
Today we started having some issues where all our vsphere volume driver starting timing out on every docker volume
operation. We did not get a file descriptor issue in our docker daemon like we had in the past, only timeouts. I'm guessing that this might be something that was recently fixed in docker EE as mentioned in the latest release notes.
We took a look at the file descriptors using the most file descriptors on the worker nodes and sure enough, the vsphere and ucp-agent processes were consuming a lot of file descriptors.
[root@ourhost ~]# for d in `ls -d /proc/[0-9]*`; do echo "`ls $d/fd | wc -l` $d"; done | sort -n | tail -10
ls: cannot access /proc/29810/fd: No such file or directory
35 /proc/24989
36 /proc/1
36 /proc/16187
38 /proc/25383
54 /proc/1624
90 /proc/1479
114 /proc/24192
269 /proc/747
22963 /proc/16204
30319 /proc/24232
[root@ourhost ~]# ps -ef | grep 24232
root 24232 24215 0 Apr02 ? 00:29:04 /usr/bin/vsphere-storage-for-docker --config /etc/vsphere-storage-for-docker.conf
root 31275 29249 0 15:06 pts/0 00:00:00 grep --color=auto 24232
[root@ourhost ~]# ps -ef | grep 16204
root 16204 16187 0 Apr02 ? 03:16:50 /bin/ucp-agent proxy --disk-usage-interval 2h --metrics-scrape-interval 1m
root 31408 29249 0 15:06 pts/0 00:00:00 grep --color=auto 16204
Environment Details:
Docker UCP with vSphere Storage for Docker volume driver.
Steps to Reproduce: Intermittent... Still don't know how to reproduce it. :(
Expected Result:
Services to relocate from node to node as designed without bringing down the node
Actual Result:
Every 2-10 days, when one of our containers is being rebuilt and having its service updated, file descriptors starting increasing at a consistent rate of around 200 per hour.
This problem actually does not occur most of the time. I've even tried to get it to occur, without success. The underlying trigger is not known yet, but we heavily suspect the bug to be in the vmware vsphere volume driver.
Triage:
Here is what we have observed after some underlying issue occurs:
docker service update
ordocker stack deploy
which causes a new container to be brought up to replace an out-of-date container.@bteichner has been the point of contact for these issues with vmware