File descriptors increasing - requires docker daemon restart and sometimes a node restart

mkboudreau commented 6 years ago

Environment Details:

Docker UCP with vSphere Storage for Docker volume driver.

We are updated to 0.21.1 on the volume driver and UCP 2.2.5

Steps to Reproduce: Intermittent... Still don't know how to reproduce it. :(

Expected Result:

Services to relocate from node to node as designed without bringing down the node

Actual Result:

Every 2-10 days, when one of our containers is being rebuilt and having its service updated, file descriptors starting increasing at a consistent rate of around 200 per hour.

This problem actually does not occur most of the time. I've even tried to get it to occur, without success. The underlying trigger is not known yet, but we heavily suspect the bug to be in the vmware vsphere volume driver.

Triage:

Here is what we have observed after some underlying issue occurs:

The issue always begins with a docker service update or docker stack deploy which causes a new container to be brought up to replace an out-of-date container.
The issue has only occurred with containers using vsphere volumes
The file descriptors are all owned by the docker daemon
Initially all containers that use vsphere volumes on the node where the issue is start failing and not responding to requests. All other containers (non-vsphere) are usually operational for a time until the docker daemon starts to become non-responsive.
Sometimes a restart of the docker daemon works.
Sometimes a restart of the docker dameon on all worker nodes is required.
Sometimes the vmware vsphere process is still around after shutting down docker and it needs to be killed before bringing docker back up.
We have had a ticket open with Docker for some time. Getting nowhere! We very much suspect this is related to vsphere.

@bteichner has been the point of contact for these issues with vmware

govint commented 6 years ago

The file descriptors being owned by the Docker daemon would make this a Docker side issue. If the vSphere volume driver is a separate process all together and if its opening files (AFAIK, the plugin only opens the VMCI socket to make calls into ESX and that path has been in use all along).

Can we try an lsof -p or _ls -l /proc//fd.

For logs, pls. set "debug' in the plugin config file and restart the plugin.

shuklanirdesh82 commented 6 years ago

Hey @govint

https://github.com/vmware/vsphere-storage-for-docker/issues/2073#issuecomment-367782421 For logs, pls. set "debug' in the plugin config file and restart the plugin.

Is there any regression that you know of? plugin restart is not able to parse the plugin config file.

mkboudreau commented 6 years ago

Regarding logs... we've been struggling to get it working from the config file, even though we've been told to do it that way. The config.json-template file seems to set the VDVS_LOG_LEVEL env var and it appears that if that var is set, then the log level from the config file would never be considered (see config.go). Am I understanding this correctly?

config.go: https://github.com/vmware/vsphere-storage-for-docker/blob/master/client_plugin/utils/config/config.go

config.json-template: https://github.com/vmware/vsphere-storage-for-docker/blob/master/plugin_dockerbuild/config.json-template

bteichner commented 6 years ago

Enabled debug logging with the following commands:

docker plugin disable -f vsphere
docker plugin set vsphere VDVS_LOG_LEVEL="debug"
docker plugin enable vsphere

govint commented 6 years ago

@mkboudreau, @bteichner, can you'll confirm what the issue is with the docker daemon, given the observed behavior. Do you'll wish to keep this issue open here as we presently don't see any bug in the volume plugin to fix.

mkboudreau commented 6 years ago

Since we turned debug logging on, we have not encountered the issue. It has always taken between 2 and 15 days between incidents. Please keep the issue open a little longer. I would really like to see this happen while we have debug logging turned on.

govint commented 6 years ago

@mkboudreau, sure no problem.

govint commented 6 years ago

@mkboudreau can you'll update on this issue, can we close if there aren't any more updates.

mkboudreau commented 6 years ago

Thank you for following up. Go ahead and close and I can always reopen it if needed.

govint commented 6 years ago

Closing.

mkboudreau commented 6 years ago

Hi @govint

Today we started having some issues where all our vsphere volume driver starting timing out on every docker volume operation. We did not get a file descriptor issue in our docker daemon like we had in the past, only timeouts. I'm guessing that this might be something that was recently fixed in docker EE as mentioned in the latest release notes.

We took a look at the file descriptors using the most file descriptors on the worker nodes and sure enough, the vsphere and ucp-agent processes were consuming a lot of file descriptors.

[root@ourhost ~]# for d in `ls -d /proc/[0-9]*`; do   echo "`ls $d/fd | wc -l`          $d"; done | sort -n | tail -10
ls: cannot access /proc/29810/fd: No such file or directory
35          /proc/24989
36          /proc/1
36          /proc/16187
38          /proc/25383
54          /proc/1624
90          /proc/1479
114          /proc/24192
269          /proc/747
22963          /proc/16204
30319          /proc/24232
[root@ourhost ~]# ps -ef | grep 24232
root     24232 24215  0 Apr02 ?        00:29:04 /usr/bin/vsphere-storage-for-docker --config /etc/vsphere-storage-for-docker.conf
root     31275 29249  0 15:06 pts/0    00:00:00 grep --color=auto 24232
[root@ourhost ~]# ps -ef | grep 16204
root     16204 16187  0 Apr02 ?        03:16:50 /bin/ucp-agent proxy --disk-usage-interval 2h --metrics-scrape-interval 1m
root     31408 29249  0 15:06 pts/0    00:00:00 grep --color=auto 16204

vmware-archive / vsphere-storage-for-docker

File descriptors increasing - requires docker daemon restart and sometimes a node restart #2073