rancher / rancher

Complete container management platform
http://rancher.com
Apache License 2.0
23.44k stars 2.97k forks source link

Extremely slow when deploying new stacks, service upgrades and restarts #8810

Closed baillieny closed 6 years ago

baillieny commented 7 years ago

Rancher versions: rancher/server: 1.5.5 rancher/agent: 1.2.2

Infrastructure Stack versions: healthcheck: 0.2.3 ipsec: 0.11.2 network-services: 0.6.6 scheduler: 0.7.5 kubernetes (if applicable): na

Docker version: (docker version,docker info preferred) Containers: 12 Running: 12 Paused: 0 Stopped: 0 Images: 15 Server Version: 17.03.1-ce Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host macvlan null overlay Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe init version: N/A (expected: 949e6facb77383876aeff8a6944dde66b3089574) Security Options: seccomp Profile: default Kernel Version: 4.9.21-rancher Operating System: RancherOS v1.0.0 OSType: linux Architecture: x86_64 CPUs: 10 Total Memory: 19.61 GiB Name: ITEDC1MMAX02 ID: BJAD:JZTL:WXKZ:6SJH:AMOL:4457:NKS7:QOJL:MF3W:K4BY:FYEB:I2QM Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Experimental: false Insecure Registries: 127.0.0.0/8 Registry Mirrors: https://docker-mirror.docker.tattsgroup.com/ Live Restore Enabled: false

Same problem is also occurring on RHEL docker hosts

Operating system and kernel: (cat /etc/os-release, uname -r preferred) RancherOS v1 4.9.21-rancher

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) Bare-Metal on vSphere

Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB) single node, external MYSQL DB

Environment Template: (Cattle/Kubernetes/Swarm/Mesos) Cattle

Steps to Reproduce: deploying new stacks, service upgrades and restarts. Occurring on all hosts on all environments

Results: Takes up to 15 minutes to deploy and start a 12 MB container Servers running Rancher Server, Database, Cattle Docker Host have low cpu/mem/disk utilization When manually pulling and running the image on the Cattle docker host the container starts in under 1 minute.

baillieny commented 7 years ago

Rancher server log snippet reply;agent=16741]: Timeout getting IP address 2017-05-18 23:24:34,868 ERROR [757c1026-7380-4479-995b-6b8154405918:5092772] [instance:113084] [instance.start->(InstanceStart)] [] [ecutorService-5] [i.c.p.process.instance.InstanceStart] Failed [3/5] to Starting for instance [113084] 2017-05-18 23:24:34,935 ERROR [757c1026-7380-4479-995b-6b8154405918:5092772] [instance:113084->instanceHostMap:86704] [instance.start->(InstanceStart)->instancehostmap.activate] [] [ecutorService-5] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=16741]: Timeout getting IP address 2017-05-18 23:24:34,935 ERROR [757c1026-7380-4479-995b-6b8154405918:5092772] [instance:113084] [instance.start->(InstanceStart)] [] [ecutorService-5] [i.c.p.process.instance.InstanceStart] Failed [4/5] to Starting for instance [113084] 2017-05-18 23:24:35,176 ERROR [473d3c1a-142f-42d3-9f22-cbe26de036d9:4962677] [service:4320] [service.update] [] [cutorService-10] [c.p.e.p.i.DefaultProcessInstanceImpl] Expected state running but got stopped: Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:359: container init caused \\"rootfs_linux.go:54: mounting \\\\"/var/lib/docker/containers/1291aff9c55c5104d76001e79de867655b3fc7bc43dde13ea8051c1256432639/secrets\\\\" to rootfs \\\\"/var/lib/docker/overlay/ffb1f07daec099428de951c929dade0e4ff20cf7dda0a519fa121fb0c7eccb08/merged\\\\" at \\\\"/var/lib/docker/overlay/ffb1f07daec099428de951c929dade0e4ff20cf7dda0a519fa121fb0c7eccb08/merged/run/secrets\\\\" caused \\\\"mkdir /var/lib/docker/overlay/ffb1f07daec099428de951c929dade0e4ff20cf7dda0a519fa121fb0c7eccb08/merged/run/secrets: read-only file system\\\\"\\"\"\n" 2017-05-18 23:24:35,186 ERROR [:] [] [] [] [cutorService-10] [.e.s.i.ProcessInstanceDispatcherImpl] Expected state running but got stopped: Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:359: container init caused \\"rootfs_linux.go:54: mounting \\\\"/var/lib/docker/containers/1291aff9c55c5104d76001e79de867655b3fc7bc43dde13ea8051c1256432639/secrets\\\\" to rootfs \\\\"/var/lib/docker/overlay/ffb1f07daec099428de951c929dade0e4ff20cf7dda0a519fa121fb0c7eccb08/merged\\\\" at \\\\"/var/lib/docker/overlay/ffb1f07daec099428de951c929dade0e4ff20cf7dda0a519fa121fb0c7eccb08/merged/run/secrets\\\\" caused \\\\"mkdir /var/lib/docker/overlay/ffb1f07daec099428de951c929dade0e4ff20cf7dda0a519fa121fb0c7eccb08/merged/run/secrets: read-only file system\\\\"\\"\"\n" 2017-05-18 23:24:35,544 ERROR [e9cdf5a7-cfed-4957-a817-2da2d5476814:5092466] [instance:113073->instanceHostMap:86693] [instance.start->(InstanceStart)->instancehostmap.activate] [] [ecutorService-4] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=16741]: Timeout getting IP address 2017-05-18 23:24:35,544 ERROR [e9cdf5a7-cfed-4957-a817-2da2d5476814:5092466] [instance:113073] [instance.start->(InstanceStart)] [] [ecutorService-4] [i.c.p.process.instance.InstanceStart] Failed [5/5] to Starting for instance [113073] 2017-05-18 23:24:35,567 ERROR [e9cdf5a7-cfed-4957-a817-2da2d5476814:5092466] [instance:113073] [instance.start] [] [ecutorService-4] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=16741]: Timeout getting IP address 2017-05-18 23:24:35,587 ERROR [:] [] [] [] [ecutorService-4] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [compute.instance.activate.reply;agent=16741]: Timeout getting IP address 2017-05-18 23:24:37,494 ERROR [29ca1062-c6d1-4de8-b6c1-278b5ac6a2c9:5092785] [instance:113085->instanceHostMap:86705] [instance.start->(InstanceStart)->instancehostmap.activate] [] [cutorService-33] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=16741]: Timeout getting IP address 2017-05-18 23:24:37,494 ERROR [29ca1062-c6d1-4de8-b6c1-278b5ac6a2c9:5092785] [instance:113085] [instance.start->(InstanceStart)] [] [cutorService-33] [i.c.p.process.instance.InstanceStart] Failed [3/5] to Starting for instance [113085] 2017-05-18 23:24:37,554 ERROR [29ca1062-c6d1-4de8-b6c1-278b5ac6a2c9:5092785] [instance:113085->instanceHostMap:86705] [instance.start->(InstanceStart)->instancehostmap.activate] [] [cutorService-33] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=16741]: Timeout getting IP address 2017-05-18 23:24:37,554 ERROR [29ca1062-c6d1-4de8-b6c1-278b5ac6a2c9:5092785] [instance:113085] [instance.start->(InstanceStart)] [] [cutorService-33] [i.c.p.process.instance.InstanceStart] Failed [4/5] to Starting for instance [113085] 2017-05-18 23:24:38,630 ERROR [8972e3c5-40da-48e7-9bb6-4770a9c037bf:5092806] [instance:113086->instanceHostMap:86706] [instance.start->(InstanceStart)->instancehostmap.activate] [] [ecutorService-8] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=16741]: Timeout getting IP address 2017-05-18 23:24:38,630 ERROR [8972e3c5-40da-48e7-9bb6-4770a9c037bf:5092806] [instance:113086] [instance.start->(InstanceStart)] [] [ecutorService-8] [i.c.p.process.instance.InstanceStart] Failed [3/5] to Starting for instance [113086] 2017-05-18 23:24:38,689 ERROR [8972e3c5-40da-48e7-9bb6-4770a9c037bf:5092806] [instance:113086->instanceHostMap:86706] [instance.start->(InstanceStart)->instancehostmap.activate] [] [ecutorService-8] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=16741]: Timeout getting IP address 2017-05-18 23:24:38,689 ERROR [8972e3c5-40da-48e7-9bb6-4770a9c037bf:5092806] [instance:113086] [instance.start->(InstanceStart)] [] [ecutorService-8] [i.c.p.process.instance.InstanceStart] Failed [4/5] to Starting for instance [113086] 2017-05-18 23:24:40,930 ERROR [703f36ce-f263-45da-88a1-c2cfe305f5e6:5093097] [instance:113095->instanceHostMap:86714] [instance.start->(InstanceStart)->instancehostmap.activate] [] [cutorService-34] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=16741]: Timeout getting IP address 2017-05-18 23:24:40,930 ERROR [703f36ce-f263-45da-88a1-c2cfe305f5e6:5093097] [instance:113095] [instance.start->(InstanceStart)] [] [cutorService-34] [i.c.p.process.instance.InstanceStart] Failed [1/5] to Starting for instance [113095] 2017-05-18 23:24:40,994 ERROR [703f36ce-f263-45da-88a1-c2cfe305f5e6:5093097] [instance:113095->instanceHostMap:86714] [instance.start->(InstanceStart)->instancehostmap.activate] [] [cutorService-34] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=16741]: Timeout getting IP address 2017-05-18 23:24:40,994 ERROR [703f36ce-f263-45da-88a1-c2cfe305f5e6:5093097] [instance:113095] [instance.start->(InstanceStart)] [] [cutorService-34] [i.c.p.process.instance.InstanceStart] Failed [2/5] to Starting for instance [113095] 2017-05-18 23:24:41,676 ERROR [6adfd60a-39e5-4a4a-a10a-ac68717ee103:5092812] [instance:113087->instanceHostMap:86707] [instance.start->(InstanceStart)->instancehostmap.activate] [] [cutorService-16] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=16741]: Timeout getting IP address

baillieny commented 7 years ago

Also tried running rancher/cleanup-1-1:v0.1.2 which was suggested in one of the forums. 0 rows were deleted. Database Server and Rancher Server have both been restarted several times

superseb commented 7 years ago

Seems like you are hitting https://github.com/rancher/rancher/issues/8582, but are you sure you are experiencing the same on RancherOS? Because that wouldn't make sense regarding the linked ticket.

baillieny commented 7 years ago

I found that there were a large number of processes in a dealyed state, this was caused by a host in one of the environments being offline. Seems like this impacts Rancher's ability to schedule new tasks

superseb commented 6 years ago

Let us know if you need anything else on this.