moby / moby

The Moby Project - a collaborative project for the container ecosystem to assemble container-based systems
https://mobyproject.org/
Apache License 2.0
68.73k stars 18.67k forks source link

Swarm Nodes Become Unavailable When Swarm is Under Heavy Load #32841

Closed afrazkhan closed 6 years ago

afrazkhan commented 7 years ago

BUG REPORT INFORMATION

Description

When any of the services in the swarm are under heavy load, one or more services (possibly all) become unavailable due to the nodes they're running on becoming unavailable. Log info below should clarify what this means.

Steps to reproduce the issue:

  1. Create swarm and some services in usual way
  2. Put one or more of those services under very heavy load (e.g. 99% CPU)

Describe the results you received: Services become unreachable. For example, if trying to connect to a service running MySQL, from another service, you will get "ERROR 2005 (HY000): Unknown MySQL server host 'db-master' (0)" -- where "db-master" is the name of the service running MySQL.

Docker Engine logs show:

... "memberlist: Marking ip-10-0-27-201-3b736f1d4651 as failed, suspect timeout reached"
... "memberlist: Failed TCP fallback ping: read tcp 10.0.25.121:52898->10.0.27.201:7946: i/o timeout"
... "memberlist: Suspect ip-10-0-27-201-3b736f1d4651 has failed, no acks received"
... "memberlist: Marking ip-10-0-27-201-3b736f1d4651 as failed, suspect timeout reached"
...

Describe the results you expected: Swarm services to stay up and discoverable even under heavy load. It looks like there are not enough resources left for nodes to effectively swap information about the state of the swarm?

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 11:02:43 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 11:02:43 2017
 OS/Arch:      linux/amd64
 Experimental: true

Output of docker info:

Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 11:02:43 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 11:02:43 2017
 OS/Arch:      linux/amd64
 Experimental: true
root@ip-10-0-27-240:~# docker info
Containers: 8
 Running: 4
 Paused: 0
 Stopped: 4
Images: 9
Server Version: 17.03.0-ce
Storage Driver: aufs                                                                                                                                                        [21/73]
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 176
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
Swarm: active
 NodeID: cfoixqufmzsoe58np7go5ftwj
 Is Manager: true
 ClusterID: ez0wwtl7pywsplpnygul8y2v8
 Managers: 4
 Nodes: 9
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 10.0.27.240
 Manager Addresses:
  10.0.25.121:2377
  10.0.25.171:2377
  10.0.27.240:2377
  10.0.27.59:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 977c511eda0925a723debdc94d09459af49d082a
runc version: a01dafd48bc1c7cc12bdb01206f9fea7dd6feb70
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-45-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.952 GiB
Name: ip-10-0-27-240
ID: 6LVV:2FUY:E2TD:DKIR:LFAX:3HF6:6ZZI:65CM:F7YY:CQ2X:IVJO:F23E
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Labels:
 type=web
 environment=production
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

cpuguy83 commented 7 years ago

Are you running services on your managers?

afrazkhan commented 7 years ago

Yes. Originally just a fairly lightweight Sinatra app I wrote to manage the swarm, and Nginx to serve it. Though now I've also made a couple of what used to be workers, also managers.

Are you maybe thinking that manager nodes under stress is the only thing that matters? I could try going back to dedicated managers in that case to see if it helps?

cpuguy83 commented 7 years ago

Seems like you are getting timeouts for DNS resolution, which is not unreasonable if nodes are under high stress.

I'm not 100% certain on how DNS resolution happens wrt the swarm. Ping @sanimej

afrazkhan commented 7 years ago

Any tips on performance profiling would be appreciated too. At the moment, a few requests come in to the webs, and our latency goes to pot. Would be great to see why it's hitting the swarm so hard.

afrazkhan commented 7 years ago

Interestingly (read: terrifyingly), this now appears to be happening even when load on all services and nodes are reasonable.

afrazkhan commented 7 years ago

Reading around, this seems related more to the use of an encrypted overlay network: https://github.com/moby/moby/issues/26523

I'll probably close this issue and open a new one with a more accurate description. Feel free to comment before that's done though!

Nossnevs commented 7 years ago

We also have this problem. We do not connect it to high load from a service but more to high load or high IO on the host( higher than average load ). Ubuntu run daily cron at 0625 every day this causes our docker swarm cluster to lose connectivity between the nodes. The main reason we think is that apt gzips something in this case. We also have this issue at random load spikes.

We have tested to monitor the connectivity on the host level by all nodes pinging all other nodes. That showed that there was no problem on the host level.

It seems to be docker swarms way to handle the network between the nodes that is very sensitive.

This issues may be related: https://github.com/moby/moby/issues/32195 https://github.com/moby/moby/issues/32738

afrazkhan commented 7 years ago

I can now also confirm that this has little if anything to do with CPU load, as I had previously thought. It is also happening at random.

cpuguy83 commented 7 years ago

This may just be the gossip convergence issue.

afrazkhan commented 7 years ago

@cpuguy83 As in; there is a problem in the swarm implementation where gossip convergence doesn't happen? If so, is it being worked on, or is there anything I can do about it?

sanimej commented 7 years ago

@afrazkhan

26523 may not be relevant here because in that issue the problem was because ESP port 50 wasn't reachable between hosts. To identify the root cause clearly can you help with these questions..

afrazkhan commented 7 years ago

Yeah ESP is allowed. Interestingly though, I did hear of another issue somewhere (sorry, can't find it now) where this problem went away after disabling encryption on the overlay network they were using.

"try using IP directly in the client and see if that works"

Tricky, because I only know the IP from resolution I get from the swarm. If the IPs don't change, then I can add it to the debugging I've already added

"high CPU running on the manager"

Originally, yes. At one point I decided to make all nodes managers. I've since gone back to dedicated managers, but it hasn't affected the rate at which this problem occurs. Digging through the logs, I've found that this problem was also happening on our test platform way back, but we just never noticed.

"Does the problem start when the CPU load was very high"

That was my initial understanding, but since then I've come to realise that it likely has little to do with load, and I was just mixing up the same symptom from two different problems: High CPU meaning the node genuinely isn't available to participate properly in the swarm, and whatever is causing this overlay network issue.

"are they being scaled up/down or deleted and recreated"

Nope.

I'll get you the results from the other two things you asked when I can. Need to add it to the debugging code I have running first, then wait for the problem again :P

afrazkhan commented 7 years ago

After moving traffic to our DBs off the overlay network and on to the normal network, we no longer get the issue. I think this points to either an encrypted overlay network issue, or a general overlay network issue. More likely to do with encryption though, since I remember someone saying that they had the problem only when using encryption on their overlay network.

Happy to provide debug logs, but not sure what would be useful.

afrazkhan commented 7 years ago

Spoke too soon, we still have the problem, even with normal networking. It happens seemingly at random, but I'm sure that can't be. Any suggestions on which logs I should get, and from which types of nodes?

jgranstrom commented 7 years ago

@afrazkhan @cpuguy83 I have a very similar problem running 3 nodes with one manager. I am getting the same error message and symptoms as you describe. This happens with or without load. I have tested multiple docker version such as 17.03, 17.04 and 17.05 (community edition). Currently running on 17.05. All machines are running Ubuntu 16.04 on 4.4.0-78-generic at Digital Ocean. I am using the private network feature of DO and thus having two interfaces on each node eth0 and eth1. I am using the IP of eth1 (the private network) as addvertise-addr for the swarm.

Firewalld is not used, but UFW is enabled and allowing all the ports required for docker, gossip and swarm. Docker is not running in experimental mode, and a single overlay network is used that is not internal and not attachable. All services are using this network, and we are running 78 services, most of them a single replica.

This problem happens across all nodes, and with a lot of margins in CPU, memory and disk resources. It can happen once a day or multiple times a day with an hour or two in between. When this happens it sometimes fixes itself by restarting the affected containers that are getting errors. Other times I have had to restart either the docker daemon, or the host machine(s). This also happens when "nothing else" is happening on any of the machines, no heavy processing or traffic, but pretty much on idle machines.

Sometimes the problem manifests itself by the DNS not resolving the service IP, running dig in the affected container does not give any service IP for the service name. Other containers can resolve the IP correctly during this, and using the service IP directly still works. Other times the container is able to resolve the IP but cannot connect to it, it simply times out. This happens consistently and keeps returning after a while even after rebooting one or all nodes, and fully purging docker and reinstalling it from scratch, recreating the swarm and services.

The only thing I have found to be a likely correlation with this is flaky network connectivity on DO's networks. I am currently working on finding out if there is general connectivity issues at the time of these issues.

Digging through a lot of logs I have found some additional details at the time of these issues.

Sometimes, but not always, I get a lot of these warnings on all nodes directly after one node is marked as failed, including on the failing node:

May 25 04:09:13 marvin dockerd[1500]: time="2017-05-25T04:09:13.316897169Z" level=warning msg="Neighbor entry already present for IP 10.255.0.7, mac 02:42:0a:ff:00:07"
May 25 04:09:13 marvin dockerd[1500]: time="2017-05-25T04:09:13.316956129Z" level=warning msg="Neighbor entry already present for IP 10.132.80.107, mac 02:42:0a:ff:00:07"
May 25 04:09:13 marvin dockerd[1500]: time="2017-05-25T04:09:13.317109927Z" level=warning msg="Neighbor entry already present for IP 10.255.0.2, mac 02:42:0a:ff:00:02"
May 25 04:09:13 marvin dockerd[1500]: time="2017-05-25T04:09:13.317145978Z" level=warning msg="Neighbor entry already present for IP 10.132.80.107, mac 02:42:0a:ff:00:02"
May 25 04:09:13 marvin dockerd[1500]: time="2017-05-25T04:09:13.317170163Z" level=warning msg="Neighbor entry already present for IP 10.255.0.13, mac 02:42:0a:ff:00:0d"
May 25 04:09:13 marvin dockerd[1500]: time="2017-05-25T04:09:13.317182730Z" level=warning msg="Neighbor entry already present for IP 10.132.80.107, mac 02:42:0a:ff:00:0d"
May 25 04:09:13 marvin dockerd[1500]: time="2017-05-25T04:09:13.318987318Z" level=warning msg="Neighbor entry already present for IP 10.0.0.220, mac 02:42:0a:00:00:dc"

Those are not seen otherwise when services are restarted or created.

When one node is lost, the other nodes sometimes get a lot of these errors when services from the failing node are being moved over to the other nodes:

May 24 04:02:40 alan dockerd[1493]: time="2017-05-24T04:02:40.312429804Z" level=error msg="Failed to delete real server 10.0.0.57 for vip 10.0.0.56 fwmark 673 in sbox bc575c2 (6c481dc): no such process"
May 24 04:02:40 alan dockerd[1493]: time="2017-05-24T04:02:40.312495946Z" level=error msg="Failed to delete service for vip 10.0.0.56 fwmark 673 in sbox bc575c2 (6c481dc): no such process"
May 24 04:02:40 alan dockerd[1493]: time="2017-05-24T04:02:40Z" level=error msg="setting up rule failed, [-t mangle -D OUTPUT -d 10.0.0.56/32 -j MARK --set-mark 673]:  (iptables failed: iptables --wait -t mangle -D OUTPUT -d 10.0.0.56/32 -j MARK --set-mark 673: iptables: No chain/target/match by that name.\n (exit status 1))"
May 24 04:02:40 alan dockerd[1493]: time="2017-05-24T04:02:40.365630123Z" level=error msg="Failed to delete firewall mark rule in sbox bc575c2 (6c481dc): reexec failed: exit status 5"

I often get a bunch of these when services are moved or restarted:

May 24 04:02:38 alan systemd-udevd[1840]: Could not generate persistent MAC address for vethcade3f4: No such file or directory
May 24 04:02:38 alan systemd-udevd[1834]: Could not generate persistent MAC address for veth8a676b9: No such file or directory
May 24 04:02:38 alan systemd-udevd[1852]: Could not generate persistent MAC address for vethdca61b8: No such file or directory
May 24 04:02:38 alan systemd-udevd[1856]: Could not generate persistent MAC address for veth66ea25c: No such file or directory

In one of these occasions I got a nasty kernel panic on the failing node right after the other nodes marked it as failing, indicating some kind of deadlock? At this point the machine went complete dark:

May 24 04:03:21 marvin kernel: NMI watchdog: BUG: soft lockup - CPU#3 stuck for 28s! [kworker/3:1:2854]
May 24 04:03:22 marvin kernel: Modules linked in: sysdigcloud_probe(OE) xt_REDIRECT nf_nat_redirect ip_vs_rr xt_ipvs ip_vs xt_nat veth vxlan ip6_udp_tunnel udp_tunnel iptable_mangle xt_mark ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 br_netfilter bridge stp llc aufs rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache ppdev kvm_intel pvpanic input_leds joydev kvm parport_pc irqbypass parport i2c_piix4 8250_fintek serio_raw mac_hid ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 xt_comment nf_log_ipv4 nf_log_common xt_LOG xt_multiport xt_recent xt_limit xt_tcpudp xt_addrtype nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast
May 24 04:03:22 marvin kernel:  ib_iser nf_nat_ftp rdma_cm nf_nat iw_cm nf_conntrack_ftp ib_cm nf_conntrack ib_sa ib_mad iptable_filter ib_core ip_tables ib_addr x_tables iscsi_tcp libiscsi_tcp libiscsi sunrpc scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel qxl ttm aesni_intel drm_kms_helper aes_x86_64 syscopyarea lrw sysfillrect gf128mul sysimgblt glue_helper fb_sys_fops ablk_helper cryptd drm psmouse virtio_scsi pata_acpi floppy [last unloaded: sysdigcloud_probe]
May 24 04:03:22 marvin kernel: CPU: 3 PID: 2854 Comm: kworker/3:1 Tainted: G           OE   4.4.0-78-generic #99-Ubuntu
May 24 04:03:22 marvin kernel: Hardware name: DigitalOcean Droplet, BIOS 20161103 11/03/2016
May 24 04:03:22 marvin kernel: Workqueue: events qxl_fb_work [qxl]
May 24 04:03:22 marvin kernel: task: ffff88030f390000 ti: ffff880438150000 task.ti: ffff880438150000
May 24 04:03:22 marvin kernel: RIP: 0010:[<ffffffff81840465>]  [<ffffffff81840465>] _raw_spin_unlock_irqrestore+0x15/0x20
May 24 04:03:22 marvin kernel: RSP: 0018:ffff880438153bc8  EFLAGS: 00000246
May 24 04:03:22 marvin kernel: RAX: 0000000000000001 RBX: ffff8807f41e8480 RCX: 0000000000000000
May 24 04:03:22 marvin kernel: RDX: 000000000000c0a0 RSI: 0000000000000246 RDI: 0000000000000246
May 24 04:03:22 marvin kernel: RBP: ffff880438153bc8 R08: 0000000000000400 R09: 0000000000000000
May 24 04:03:22 marvin kernel: R10: 000000000000f000 R11: 0000000000000000 R12: ffffc900035dd00c
May 24 04:03:22 marvin kernel: R13: ffff8807f41e84a0 R14: 0000000000000246 R15: ffff880438153c48
May 24 04:03:22 marvin kernel: FS:  0000000000000000(0000) GS:ffff8807ff2c0000(0000) knlGS:0000000000000000
May 24 04:03:22 marvin kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 24 04:03:22 marvin kernel: CR2: 00007f548c0ed000 CR3: 0000000001e0a000 CR4: 00000000001406e0
May 24 04:03:22 marvin kernel: Stack:
May 24 04:03:22 marvin kernel:  ffff880438153c38 ffffffffc03d8ca9 ffff880400000000 ffffffffc02b5a91
May 24 04:03:22 marvin kernel:  ffffc90008000000 ffff880208372000 0000000000000000 ffff8807f4414218
May 24 04:03:22 marvin kernel:  000000004215b860 0000000000001000 ffff8807f4414000 0000000000000020
May 24 04:03:22 marvin kernel: Call Trace:
May 24 04:03:22 marvin kernel:  [<ffffffffc03d8ca9>] qxl_ring_push+0x99/0x250 [qxl]
May 24 04:03:22 marvin kernel:  [<ffffffffc02b5a91>] ? ttm_mem_io_reserve+0x41/0xc0 [ttm]
May 24 04:03:22 marvin kernel:  [<ffffffffc03d8ece>] qxl_push_command_ring_release+0x6e/0x90 [qxl]
May 24 04:03:22 marvin kernel:  [<ffffffffc03da78e>] qxl_draw_opaque_fb+0x27e/0x3a0 [qxl]
May 24 04:03:22 marvin kernel:  [<ffffffffc03d6f01>] qxl_fb_dirty_flush+0x1a1/0x250 [qxl]
May 24 04:03:22 marvin kernel:  [<ffffffffc03d6fc9>] qxl_fb_work+0x19/0x20 [qxl]
May 24 04:03:22 marvin kernel:  [<ffffffff8109a565>] process_one_work+0x165/0x480
May 24 04:03:22 marvin kernel:  [<ffffffff8109a8cb>] worker_thread+0x4b/0x4c0
May 24 04:03:22 marvin kernel:  [<ffffffff8109a880>] ? process_one_work+0x480/0x480
May 24 04:03:22 marvin kernel:  [<ffffffff8109a880>] ? process_one_work+0x480/0x480
May 24 04:03:22 marvin kernel:  [<ffffffff810a0bf8>] kthread+0xd8/0xf0
May 24 04:03:22 marvin kernel:  [<ffffffff810a0b20>] ? kthread_create_on_node+0x1e0/0x1e0
May 24 04:03:22 marvin kernel:  [<ffffffff81840dcf>] ret_from_fork+0x3f/0x70
May 24 04:03:22 marvin kernel:  [<ffffffff810a0b20>] ? kthread_create_on_node+0x1e0/0x1e0
May 24 04:03:22 marvin kernel: Code: 1f 44 00 00 eb c6 31 c0 eb ca e8 97 0c 84 ff 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 ff 14 25 48 c3 e2 81 48 89 f7 57 9d <0f> 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 c6 

Note that this specifically only happened once out of multiple occasions.

Sorry for it being a bit all over the place, but that is kind of how the situation is at this point and I'm trying to piece together all the intel I can find in order for something to maybe stand out. Thanks!

DBLaci commented 7 years ago

I had similar problems as stated above (with Ubuntu 16.04.2 LTS 4.4.0-78-generic kernel) (but I never got kernel panic and the VPS is not at DigitalOcean)

Installing linux-generic-hwe-16.04 ( apt-get install linux-generic-hwe-16.04 ) installed 4.8 kernel ( 4.8.0-52-generic ) and the problem seems to be gone since.

jgranstrom commented 7 years ago

What is weird is after I get this error on the manager

May 25 17:26:29 alan dockerd[1493]: time="2017-05-25T17:26:29.334172883Z" level=info msg="memberlist: Marking marvin-2363cd64d55c as failed, suspect timeout reached"

The failing node is still marked as active

ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
dpylaeisxrvly9ldb89jnmi0o *   alan                Ready               Active              Leader
sl6kdp1c5cwya9fetsvi1g2xp     marvin              Ready               Active
val9dvjji0pgjo3bvy1do3mtn     ada                 Ready               Active

And services running on the failed node are still "running" according to docker service ps. (Which they in reality are, but on a failed node).

This means that the services are not automatically moved to another host and becomes unavailable for all other containers. In this case the service name cannot be resolved, and the service/task IP cannot be reached due to no route to host.

However, after stopping the containers on the failed node, and letting docker start them automatically on that same host, they became available again. So I guess in this case the failed node was failing for a short while, and then came back as active. The problem seems to have been that the services that were running on the failed node were lost in the routing when it was marked as failed and not restored properly when the node came back as active.

afrazkhan commented 7 years ago

Any one looking into this? We've migrated to swarm fully, and are now slowly rolling back piece by piece. We can't run our platform like this, since as a result of the above symptoms, we're serving back 500s every so often.

cpuguy83 commented 7 years ago

@afrazkhan Do you have some more details? What's serving the 500 and why?

cpuguy83 commented 7 years ago

Is it just the backends not resolving and the frontends returning the 500?

vovimayhem commented 7 years ago

@sanimej: Looks like @Nossnevs found a big lead on what's going on at https://github.com/moby/moby/issues/32195#issuecomment-305457491

afrazkhan commented 7 years ago

Thanks @vovimayhem, that sounds exactly like it, though I haven't checked in the same way he has. I'll make my voice heard there too.

sanimej commented 7 years ago

@Nossnevs Earlier you mentioned that

Ubuntu run daily cron at 0625 every day this causes our docker swarm cluster to lose connectivity between the nodes.

Is it possible to stop this cron service and observe if the intermittent connection loss still happens ?

When there is high CPU or heavy IO load in the host it can lead to temporary issues with the gossip communication between nodes.

jgranstrom commented 7 years ago

@sanimej the same symptoms for us seems to be related to flaky network connectivity between the nodes. We are running on digital ocean and some our tests are showing one off timeouts and refused connections between nodes themselves at random times of the day.

How much is this expected to affect the gossip communication? Currently it seems that when we are "unlucky" and we get some dropped packages when docker is doing some sensitive communication one or more nodes are marked as failed, and it takes a long time for it to recover even though the network outage does not last for more than one request or so.

Mainly when this happens it seems that the routing / arp entries are getting mixed up or lost and not recovered. It can in some cases get back in business after an hour or more, but usually it requires restarting one or often all docker daemons. The containers are still running and the nodes can communicate but some containers fail to communicate with other containers.

sanimej commented 7 years ago

@jgranstrom Thanks for all the details you provided earlier. It does look like there are different symptoms.

Are there lot of service create/updates happening while the DO network has intermittent connectivity problem ? Or the Service Discovery and connectivity issue happens even without much of service create/update operations ?

Service Discovery issue you noticed could same as https://github.com/moby/moby/issues/30134. We are working on a fix for this.

jgranstrom commented 7 years ago

@sanimej this happens even when there's no create/update operations going on at all, I can't see any correlation with any management operations and the issues themselves. However, it might be a chain reaction of different events. So when we get connectivity problems one or more nodes are being marked as failed due to timeouts or refuted messages that is reported in the memberlist logs by other nodes. When this happens we usually get some really unpredictable behavior. Some tasks (it can either be none or just a few of them) on the failing node starts to move to other nodes, and as that goes on we start to get service discovery and connectivity problems between containers. So the actual service discovery and container connectivity issues might be related to the tasks being redistributed.

As I mentioned earlier however, this seems to mostly be due to the fact that some tasks are staying on the failing node because the node is coming back as active pretty much right away (it is in active state when checking with docker node ls), probably because the connectivity problems between nodes only last for a very short amount of time. The tasks on the recently failed node are all running and happy according to docker, the manager sees them as running and it looks fine when checking them with docker service ps. As this happens however it seems like the routing goes inconsistent for some services and never really recovers. It might be some timing related issue when a node goes down and comes back up within a very short amount of time, so it comes back up while docker is already trying to redistribute the tasks on other nodes. I guess this could be tested by just disabling the network interface on a worker node and enabling it as soon as the manager notices that the node has failed.

I totally understand that this stuff is probably challenging to debug and get to the root of especially when there seems to be a lot of semi-related symptoms mentioned in a lot of different issues, and not really anyone that have fully been able to reproduce them on demand. However, there seems to be one or more fundamental problems related to all of this that is really putting swarm in a state where it really can't with confidence be used for production environments without expecting some downtime and resource overhead for keeping it up and running, even forcing people to move out of using swarm as mentioned in some issues. If there is anything I can help you with, getting some more logs or detailed information about these symptoms when they occur I am very happy to help to get this sorted out. If you can give me some pointers on what kind of details or information that can be helpful in addition to what I have found so far, I will report back everything I can manage to get hold of.

Nossnevs commented 7 years ago

@sanimej we have move it so that it is on different times on different nodes. This cron job is standard in Ubuntu server and will run everything in /etc/cron.daily 06:25.

When we moved the script we did not see any improvements. We have also moved backups and other scheduled jobs evenly over the night. We do not have that many containers now because we moved out the production ones outside swarm. So our environment is not the same as before. Will the fix in https://github.com/moby/moby/issues/32195 be relevant here to?

afrazkhan commented 7 years ago

Seems like Docker devs see this as resolved in a recent commit:

Other thread I've been following (basically, exactly this problem): https://github.com/moby/moby/issues/32195#event-1114831955

PR which supposedly fixes these issues: https://github.com/moby/moby/pull/33555

jgranstrom commented 7 years ago

@sanimej, @fcrisciani I kept digging and I have some additional details about what happens during connectivity issues. This is an actual scenario when at least one service goes into weird unstable connectivity mode similar to what many open issues are mentioning such as #32195.

Again I have three nodes. Many services are running, here I am only inspecting one affected target service. There is a global load balancer running on each of the three machines. The target service is running on the same node as load balancer 2. The majority of the services experience no problems during this, but 5-10% of the running services are hit by the same problem.

Here is some results from digging around during the connectivity issues.

Target service

ip addr

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2094: eth0@if2095: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue state UP
    link/ether 02:42:0a:00:00:1b brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.27/24 scope global eth0
       valid_lft forever preferred_lft forever
    inet 10.0.0.174/32 scope global eth0
       valid_lft forever preferred_lft forever
2096: eth1@if2097: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
    link/ether 02:42:ac:12:00:10 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.16/16 scope global eth1
       valid_lft forever preferred_lft forever

Note the additional M-DOWN status, not sure about that one really but I have found it to be correlated with the target containers that are involved in connectivity issues.

From all load balancers the dns successfully resolves the target service with IP 10.0.0.174.

arp table extract

IP address       HW type     Flags       HW address            Mask     Device
...
10.0.0.9         0x1         0x2         02:42:0a:00:00:09     *        eth0 <- load balancer 2
10.0.0.43        0x1         0x2         02:42:0a:00:00:2b     *        eth0 <- load balancer 3
10.0.0.115       0x1         0x2         02:42:0a:00:00:73     *        eth0 <- load balancer 1
...

The target service can reach all load balancers on both overlay IPs.

Load balancer 1:

ip addr

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
183: eth2@if184: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 02:42:0a:ff:00:08 brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet 10.255.0.8/16 scope global eth2
       valid_lft forever preferred_lft forever
    inet 10.255.0.9/32 scope global eth2
       valid_lft forever preferred_lft forever
185: eth1@if186: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:12:00:05 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet 172.18.0.5/16 scope global eth1
       valid_lft forever preferred_lft forever
187: eth0@if188: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 02:42:0a:00:00:73 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.0.0.115/24 scope global eth0
       valid_lft forever preferred_lft forever
    inet 10.0.0.32/32 scope global eth0
       valid_lft forever preferred_lft forever

arp table extract

...
10.0.0.27        0x1         0x2         02:42:0a:00:00:1b     *        eth0 <- target service
...

Only one of the target service IPs is in the table.

LB1 can reach (ping and curl) the target service on both 10.0.0.174, 10.0.0.27 and service name.

ARPING 10.0.0.27
42 bytes from 02:42:0a:00:00:1b (10.0.0.27): index=0 time=10.432 msec

ARPING 10.0.0.174
Timeout

LB1 can successfully arping one of the IPs of the target service.

Load balancer 2

ip addr

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
62: eth2@if63: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 02:42:0a:ff:00:06 brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet 10.255.0.6/16 scope global eth2
       valid_lft forever preferred_lft forever
    inet 10.255.0.9/32 scope global eth2
       valid_lft forever preferred_lft forever
64: eth1@if65: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:12:00:05 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet 172.18.0.5/16 scope global eth1
       valid_lft forever preferred_lft forever
66: eth0@if67: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 02:42:0a:00:00:09 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.0.0.9/24 scope global eth0
       valid_lft forever preferred_lft forever
    inet 10.0.0.32/32 scope global eth0
       valid_lft forever preferred_lft forever

arp table extract

IP address       HW type     Flags       HW address            Mask     Device
...
10.0.0.27        0x1         0x2         02:42:0a:00:00:1b     *        eth0 <- target service
...

Only one of the target service IPs is in the table.

LB2 can reach (ping and curl) the target service on both 10.0.0.174, 10.0.0.27 and service name.

ARPING 10.0.0.27
42 bytes from 02:42:0a:00:00:1b (10.0.0.27): index=0 time=14.801 msec

ARPING 10.0.0.174
42 bytes from 02:42:0a:00:00:1b (10.0.0.174): index=0 time=9.971 msec

LB2 can sucessfully arping both of the target service IPs.

Load balancer 3

This one taking the measurable hit with connectivity issues towards the target service. Failing to reach it completely for a certain time (20-60 minutes or more), together with flaky connectivity, reaches a few times then fails for a while again etc.

ip addr

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
169: eth2@if170: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 02:42:0a:ff:00:0a brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet 10.255.0.10/16 scope global eth2
       valid_lft forever preferred_lft forever
    inet 10.255.0.9/32 scope global eth2
       valid_lft forever preferred_lft forever
171: eth1@if172: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:12:00:08 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet 172.18.0.8/16 scope global eth1
       valid_lft forever preferred_lft forever
173: eth0@if174: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 02:42:0a:00:00:2b brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.0.0.43/24 scope global eth0
       valid_lft forever preferred_lft forever
    inet 10.0.0.32/32 scope global eth0
       valid_lft forever preferred_lft forever

arp table extract

...
IP address       HW type     Flags       HW address            Mask     Device
10.0.0.27        0x1         0x2         02:42:0a:00:00:1b     *        eth0 <-- target service
10.0.0.174       0x1         0x0         00:00:00:00:00:00     *        eth0 <-- target service
...

Both of the target service IPs is in the table, one of them has failed to resolve.

LB3 sometimes fails to ping or curl the target service by service name and IP.

ARPING 10.0.0.27
Timeout
...
ARPING 10.0.0.27
42 bytes from 02:42:0a:00:00:1b (10.0.0.27): index=0 time=9.592 msec

ARPING 10.0.0.174
Timeout

LB3 can sometime successfully arping one of the IPs of the target service.

tcpdump during failed arping attempt

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
ARP, Request who-has 10.0.0.27 tell 10.0.0.43, length 28

No arp response is logged for that request from anywhere.

LB3 arp table when it first started to not reach the target service

IP address       HW type     Flags       HW address            Mask     Device
10.0.0.115       0x1         0x2         02:42:0a:00:00:73     *        eth0
10.0.0.43        0x1         0x2         02:42:0a:00:00:2b     *        eth0
10.0.0.9         0x1         0x2         02:42:0a:00:00:09     *        eth0

No trace of the target service.

After then trying to reach it by ip directly with curl 10.0.0.174

Results in error

curl: (7) Failed to connect to 10.0.0.174 port 80: No route to host

arp table after curl attempt

IP address       HW type     Flags       HW address            Mask     Device
10.0.0.115       0x1         0x2         02:42:0a:00:00:73     *        eth0
10.0.0.43        0x1         0x2         02:42:0a:00:00:2b     *        eth0
10.0.0.27        0x1         0x0         00:00:00:00:00:00     *        eth0 <-- target service added with its other ip but failed to resolve mac
10.0.0.9         0x1         0x2         02:42:0a:00:00:09     *        eth0

In addition the arp response times seems to be a bit flaky, example:

42 bytes from 02:42:0a:00:00:73 (10.0.0.115): index=0 time=13.015 msec
42 bytes from 02:42:0a:00:00:73 (10.0.0.115): index=1 time=3.783 msec
42 bytes from 02:42:0a:00:00:73 (10.0.0.115): index=2 time=10.842 msec
42 bytes from 02:42:0a:00:00:73 (10.0.0.115): index=3 time=13.786 msec
42 bytes from 02:42:0a:00:00:73 (10.0.0.115): index=4 time=8.350 msec
42 bytes from 02:42:0a:00:00:73 (10.0.0.115): index=5 time=10.978 msec
42 bytes from 02:42:0a:00:00:73 (10.0.0.115): index=6 time=1.555 msec
42 bytes from 02:42:0a:00:00:73 (10.0.0.115): index=7 time=8.190 msec
42 bytes from 02:42:0a:00:00:73 (10.0.0.115): index=8 time=6.216 msec
42 bytes from 02:42:0a:00:00:73 (10.0.0.115): index=9 time=12.878 msec

I am wondering if this can be related to something like this #33789? I am running on digital ocean and network congestion especially on the UDP protocol is definitely a thing, but there has to be a way around that since it is the reality of many production environments.

jgranstrom commented 7 years ago

@sanimej I upgraded to 17.06 and I'm seeing a lot of node join events being repeated for the same nodes in the logs.

Jul  4 10:59:51 dockerd[1699]: time="2017-07-04T10:59:51.610620616Z" level=info msg="Node join event for marvin"
Jul  4 11:00:51 dockerd[1699]: time="2017-07-04T11:00:51.615033028Z" level=info msg="Node join event for marvin"
Jul  4 11:01:01 dockerd[1699]: time="2017-07-04T11:01:01.333260909Z" level=info msg="Node join event for marvin"
Jul  4 11:01:21 dockerd[1699]: time="2017-07-04T11:01:21.616786835Z" level=info msg="Node join event for marvin"
Jul  4 11:01:31 dockerd[1699]: time="2017-07-04T11:01:31.335323308Z" level=info msg="Node join event for marvin"
Jul  4 11:02:21 dockerd[1699]: time="2017-07-04T11:02:21.621289285Z" level=info msg="Node join event for marvin"
Jul  4 11:03:01 dockerd[1699]: time="2017-07-04T11:03:01.341312801Z" level=info msg="Node join event for marvin"

Is this expected or sign of some connectivity issues?

sanimej commented 7 years ago

@jgranstrom This doesn't indicate any connectivity issue. It should have been a debug. An issue has been filed for this, https://github.com/moby/moby/issues/33962#issuecomment-313163859

mo22 commented 7 years ago

Similar problem here, occurs under heavy IO load, both with integrated and external storage.

2017-07-12T12:48:49.941714411Z memberlist: Failed fallback ping: read tcp: i/o timeout
2017-07-12T12:48:49.941893525Z memberlist: Suspect xxx-79599d82f216 has failed, no acks received
2017-07-12T12:48:52.471544290Z Node join event for srv067171.nue2.bigpoint.net-79599d82f216/10.72.67.171
2017-07-12T12:48:52.847381111Z memberlist: Was able to connect to xxx but other probes failed, network may be miscon

This occurred on a 2 node setup with lots of cpu (80 cores / 256 GB ram / 10 Gbit network interconnect) while only disk io was at 100%. I haven't checked if consul waits for disk IO and what the timeout is, would be nice to be able to increase the timeout to avoid this.

We are now starting the containers manually on each node, still using swarm networking / macvlan, to avoid having the containers killed on timeout+rejoin events.

thaJeztah commented 6 years ago

With https://github.com/docker/libnetwork/pull/1944, https://github.com/docker/libnetwork/pull/2013, https://github.com/docker/libnetwork/pull/2047 these problems should be resolved.

If you're still having issues on a current version of Docker, please open a new ticket, as they will probably have a different cause.