Open kmott opened 6 years ago
Here's the logs from one of the VMs that eventually hard locked:
I have the VIB log files as well from each of our ESXi hosts, but there's a lot of other info in them unrelated to the problem at hand, so I'd prefer a direct message to send the log files to, rather than posting here. Is that possible?
@kmott, are EQL1-DevOpsVol-01 and CMP1-WorkstationVol-01 datastores in the ESX host (ls /vmfs/volumes)? These extra volumes seem to be what docker seems to be creating on its own. Could check if Minio creates these volumes via Docker.
Can you also post "cat /proc/mounts", if these volumes are all mounted and hence in use, that may indicate why the VM goes slow over time (assuming that these many volumes are in use at once).
@govint, the datastores are legit on each of our 4 ESXi hosts, but the large-hash prefix that Docker is creating is not.
Minio doesn't create the volumes (as far as I know), it just uses whatever path(s) you specify when starting the container (and if the path(s) you specify point to docker volumes that are not initialized, minio will initialize them for you).
Here's the cat /proc/mounts
output:
root@cluster1-docker1:~# cat /proc/mounts
rootfs / rootfs rw 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
udev /dev devtmpfs rw,relatime,size=10240k,nr_inodes=4125124,mode=755 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,relatime,size=6603564k,mode=755 0 0
/dev/sda1 / ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0
tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=23,pgrp=1,timeout=300,minproto=5,maxproto=5,direct 0 0
mqueue /dev/mqueue mqueue rw,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0
/dev/sdc1 /data/consul ext4 rw,relatime,data=ordered 0 0
/dev/sdd1 /var/lib/docker/volumes/default ext4 rw,relatime,data=ordered 0 0
/dev/sdb1 /data/nomad ext4 rw,relatime,data=ordered 0 0
rpc_pipefs /run/rpc_pipefs rpc_pipefs rw,relatime 0 0
binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
/dev/sda1 /var/lib/docker/plugins ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
/dev/sda1 /var/lib/docker/aufs ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
tmpfs /data/nomad/alloc/4a421d6e-d5aa-cc43-9518-88157e9dadbd/traefik-loadbalancer/secrets tmpfs rw,noexec,relatime,size=1024k 0 0
proc /run/docker/netns/default proc rw,nosuid,nodev,noexec,relatime 0 0
none /var/lib/docker/aufs/mnt/528aebed5fbed47d4b86667f30c55bfeb6556964071874706f5450bb835f3796 aufs rw,relatime,si=70c8e26d17d1fc77,dio,dirperm1 0 0
/dev/sda1 /var/lib/docker/containers/70db665016a27a1e6bae159946270f55dff629ade79f2e4c312d1bebb5c908e6/mounts ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
shm /var/lib/docker/containers/70db665016a27a1e6bae159946270f55dff629ade79f2e4c312d1bebb5c908e6/mounts/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=65536k 0 0
/dev/sda1 /var/lib/docker/plugins/d78002c4bba0460f35eaa97c2b6df4c0cb4f9d42c90b3b2eef781e2d28cc7bf7/propagated-mount ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
/dev/sda1 /var/lib/docker/plugins/d78002c4bba0460f35eaa97c2b6df4c0cb4f9d42c90b3b2eef781e2d28cc7bf7/rootfs/mnt/vmdk ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
Interestingly, it seems like if I do not re-start the Minio job in Nomad (e.g., just leave it DEAD and reboot both docker nodes), I don't have any additional stability issues. Note that docker volume ls
still shows the defunct volumes in this case, but each of the minio
ones are not mounted by a container and currently in-use.
From the vsphere plugini logs, these volumes with long names are actually anonymous volumes that docker creates when its asked to create a volume on the fly. One example is below, will need to check with docker logs why its causing these volumes to be created.
2018-03-09 15:36:08.835927686 -0800 PST [ERROR] Failed to get volume meta-data error="Volume 634bc83d5c9346d11d58b9d9c493a5402f1b18d7b1fe1cc322c0383b8e801fa7 not found (file: /vmfs/volumes/EQL1-DevOpsVol01/dockvols/_DEFAULT/634bc83d5c9346d11d58b9d9c493a5402f1b18d7b1fe1cc322c0383b8e801fa7.vmdk)" name=634bc83d5c9346d11d58b9d9c493a5402f1b18d7b1fe1cc322c0383b8e801fa7 2018-03-09 15:36:10.095907092 -0800 PST [INFO] Attaching volume and creating filesystem name=940b0c940e4fbbed53898247d69b6274fd6782aead68d12338fd2ec1402dd9a8 fstype=ext4
2018-03-09 15:36:11.116030901 -0800 PST [INFO] Device file found. device="/dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:5:0" 2018-03-09 15:36:11.951960645 -0800 PST [INFO] Attaching volume and creating filesystem name=634bc83d5c9346d11d58b9d9c493a5402f1b18d7b1fe1cc322c0383b8e801fa7 fstype=ext4
2018-03-09 15:36:15.32395768 -0800 PST [INFO] Volume and filesystem created fstype=ext4 name=634bc83d5c9346d11d58b9d9c493a5402f1b18d7b1fe1cc322c0383b8e801fa7
Doesn't seem to be an issue with the plugin as far as these volumes go. For the lock up, please post the dmesg Linux kernel logs when that happens. Again, could try increasing memory for the VM in case thats the issue for the workload generated with Minio.
@govint, the VMs have 32GB RAM each. I will get the dmesg and docker output from the lock-up, I just re-started all of the services this morning, should be able to post something tomorrow.
Note that I am not able to get the dmesg output--the system is hard-locked, and nothing is dumped to kern.log. Console is also unresponsive during this time, I have to hard-reset it to get anything back.
Attached some docker logs during the time of the lock-up (system locked up just after Apr 9 @ 20:42:19 PST, local time on the box).
@kmott thanks I'll check these and get back.
@govint Did you have a chance to check things out?
Description:
I am trying a test run of the vDVS plugins, and am running into a few problems. The main issue I am having is that our cluster VM's that are running the Docker daemon seem to hard-lock after a little while. When I reboot them and re-add the Docker vSphere volumes, it actually shows up with a bunch more other volumes in 'docker volume ls' output (note that the only 4 I added via the
docker volume create --driver=vsphere volname@datastore
command are at the bottom, the minio volumes)Environment Details:
Steps to Reproduce:
docker volume create --driver=vsphere minio1@CMP1-DevOps-Docker1
docker volume create --driver=vsphere minio1@CMP1-DevOps-Docker2
docker volume create --driver=vsphere minio2@CMP1-DevOps-Docker1
docker volume create --driver=vsphere minio2@CMP1-DevOps-Docker2
docker volume ls
output.Expected Result:
VM system should not lock up, and
docker volume ls
should only emit added volumes from vSphere viadocker volume create ...
syntax.Actual Result:
VM system hard-locks, and emits un-wanted docker volumes in
docker volume ls
output.Triage:
I am working on getting the VIB log files from each of the hosts, as well as the VM log file
vsphere-storage-for-docker.log
as well. Here's a snippet fromdocker volume ls
: