UDP memberlist checks always timing out

ghost commented 8 years ago

`consul version` for both Client and Server

Client: Consul v0.6.4 Consul Protocol: 3 (Understands back to: 1) Server: Consul v0.6.4 Consul Protocol: 3 (Understands back to: 1)

`consul info` for both Client and Server

Client:

agent:
    check_monitors = 0
    check_ttls = 1
    checks = 1
    services = 1
build:
    prerelease =
    revision = 26a0ef8c
    version = 0.6.4
consul:
    known_servers = 3
    server = false
runtime:
    arch = amd64
    cpu_count = 1
    goroutines = 35
    max_procs = 1
    os = linux
    version = go1.6
serf_lan:
    encrypted = false
    event_queue = 0
    event_time = 178
    failed = 0
    intent_queue = 0
    left = 0
    member_time = 2845
    members = 18
    query_queue = 0
    query_time = 1

Server:

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 0
    services = 1
build:
    prerelease =
    revision = 26a0ef8c
    version = 0.6.4
consul:
    bootstrap = false
    known_datacenters = 1
    leader = true
    server = true
raft:
    applied_index = 178538
    commit_index = 178538
    fsm_pending = 0
    last_contact = never
    last_log_index = 178538
    last_log_term = 1
    last_snapshot_index = 172464
    last_snapshot_term = 1
    num_peers = 2
    state = Leader
    term = 1
runtime:
    arch = amd64
    cpu_count = 2
    goroutines = 121
    max_procs = 2
    os = linux
    version = go1.6
serf_lan:
    encrypted = false
    event_queue = 0
    event_time = 178
    failed = 0
    intent_queue = 0
    left = 0
    member_time = 2845
    members = 18
    query_queue = 0
    query_time = 1
serf_wan:
    encrypted = false
    event_queue = 0
    event_time = 1
    failed = 0
    intent_queue = 0
    left = 0
    member_time = 1
    members = 1
    query_queue = 0
    query_time = 1

Operating system and Environment details

Ubuntu 14.04 3 node cluster in an AWS VPC across 3 availability zones

Description of the Issue (and unexpected/desired result)

Consul constantly reports warnings of the following form for all nodes (clients and servers):

2016/07/20 22:07:04 [DEBUG] memberlist: Failed UDP ping: placeholder-node-name (timeout reached)
2016/07/20 22:07:05 [WARN] memberlist: Was able to reach placeholder-node-name via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP

Here placeholder-node-name ends up being every node but itself in the logs.

There are around 10 member nodes, so it doesn't seem clients is the issue, and in two other VPC's we are running nearly identical consul clusters with no warnings. Hard networking issues (connectivity, etc.) were investigated in conjunction with AWS and returned no obvious issues, so the thought here is the issue is on the Consul side of things.

I have come across these issues, and realize this is somewhat of a duplicate: https://github.com/hashicorp/consul/issues/916 https://github.com/hashicorp/consul/issues/2152

We are running some t2 instances, but removing those from the Consul registry did not fix the problem. I know 0.7.0 will include additional logging around node probing, but as this cluster is used for discovery of some of our core infrastructure, I'd rather Consul not be logging constant warnings. Do you have any other recommendations for addressing/investigating this issue?

There have been no observed issues in terms of performance.

Reproduction steps

The entire Consul cluster (all server machines) have been recycled yet still show the same behavior.

slackpad commented 8 years ago

Hi @mjcdiggity it sounds like you probably checked this, but are there any firewalls or network ACLs that could be at play here?

ghost commented 8 years ago

@slackpad so we did some thorough network related testing and were able to confirm that all nodes are able to communicate over UDP on port 8301 (we are using default ports). We made sure to investigate those avenues before opening up an issue.

The one thing that did show up was that the consul servers did not seem to be probing each other (via UDP on 8301). This behavior differed from our other clusters in that, while not that often, that traffic was being sent. While that is not definitive (those packets might be tried at some point either before or after we were monitoring), no UDP send or receives showed up between the consul servers themselves (monitoring done via tcpdump).

slackpad commented 8 years ago

@mjcdiggity for normal operation you'd want TCP and UDP on 8301, but that doesn't look like it's the issue looking at the logs. The servers not talking might be normal depending on the size of your cluster because agents will randomly choose another node to probe.

If the network looks good, the other thing we've seen cause issues is CPU starvation, especially on low end AWS instances. These often start shedding packets when things get busy.

ghost commented 8 years ago

@slackpad thanks for the info, will look into this further.

SunSparc commented 8 years ago

@mjcdiggity, did you ever find out what the problem was?

We are experiencing the same thing with a lot of our nodes. Here is an extract from the logs:

Oct 31 13:44:22 consul-server-01 consul:  [WARN] memberlist: Was able to reach app-node-06 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Oct 31 15:13:02 app-node-01 consul:  [WARN] memberlist: Was able to reach app-node-06 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Oct 31 15:32:58 app-node-02 consul:  [WARN] memberlist: Was able to reach app-node-03 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Oct 31 16:32:45 app-node-03 consul:  [WARN] memberlist: Was able to reach app-node-06 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Oct 31 16:45:14 consul-server-02 consul:  [WARN] memberlist: Was able to reach app-node-08 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Oct 31 17:54:51 app-node-05 consul:  [WARN] memberlist: Was able to reach app-node-02 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Oct 31 18:08:10 app-node-02 consul:  [WARN] memberlist: Was able to reach app-node-03 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Oct 31 18:21:27 app-node-02 consul:  [WARN] memberlist: Was able to reach app-node-01 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Oct 31 18:23:05 app-node-05 consul:  [WARN] memberlist: Was able to reach app-node-01 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Oct 31 18:39:42 app-node-06 consul:  [WARN] memberlist: Was able to reach app-node-01 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Oct 31 18:46:21 app-node-07 consul:  [WARN] memberlist: Was able to reach app-node-05 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Oct 31 18:52:29 app-node-02 consul:  [WARN] memberlist: Was able to reach app-node-03 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Oct 31 19:06:27 app-node-06 consul:  [WARN] memberlist: Was able to reach app-node-07 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Oct 31 19:35:46 app-node-06 consul:  [WARN] memberlist: Was able to reach consul-server-03 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Oct 31 20:05:02 app-node-01 consul:  [WARN] memberlist: Was able to reach app-node-05 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Oct 31 20:17:32 app-node-07 consul:  [WARN] memberlist: Was able to reach consul-server-04 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Oct 31 20:27:12 app-node-01 consul:  [WARN] memberlist: Was able to reach consul-server-01 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Oct 31 21:30:42 app-node-07 consul:  [WARN] memberlist: Was able to reach consul-server-03 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Oct 31 21:40:57 app-node-05 consul:  [WARN] memberlist: Was able to reach app-node-02 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Oct 31 21:46:26 app-node-01 consul:  [WARN] memberlist: Was able to reach app-node-03 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP

It does not seem to be consistent. If it was a firewall issue I would expect each node having problems to have the same problems consistently. None of these nodes is under much load.

ghost commented 8 years ago

@SunSparc Do not have a solution yet. The hope was that Consul 0.7.x would fix things but I have not gotten around to that rollout yet. I did recently notice a configuration error on our end where all nodes are initially connecting to and querying a single consul server for dns, but I don't have any traffic numbers around that. All instances seem to be under very little load, so yeah still confused. Again, I'm hopeful consul 0.7.x fixes/exposes the issue, or that our eventual configuration update to use all nodes in our cluster will cause these errors to go away.

SunSparc commented 8 years ago

We are using 0.7.0-dev. We have consul currently implemented on about 100 nodes, across 4 datacenters. We have ufw (aka iptables) running on all the nodes with exceptions for the necessary ports and protocols:

Default: deny (incoming), allow (outgoing), disabled (routed)

To                         Action      From
--                         ------      ----
8301,8302/tcp (Consul Agent) on zt0 ALLOW IN    Anywhere
8301,8302/udp (Consul Agent) on zt0 ALLOW IN    Anywhere
8300/tcp (Consul Server) on zt0 ALLOW IN    Anywhere
8500/tcp (Consul HTTP) on zt0 ALLOW IN    Anywhere

However, as a test, we disabled ufw on the 12 servers and continued to watch the logs. The UDP warnings continued on memberlist communication between the servers.

For us, the warnings are not constant, but they are persistent. We have decided that they are likely occasional network hiccups. Transient timeouts, dropped packets, whatever. I open consul monitor -log-level=debug and see entries like this:

[DEBUG] memberlist: Failed UDP ping: app-node-01 (timeout reached)

I am not entirely sure that the problem is with consul. That is just where I see the message. It could be something network related. Or perhaps consul is just overly sensitive? These are just warning, after all.

SunSparc commented 8 years ago

In case this can be of any help, here are some log entries surrounding two events.

----
Source Log:
server1-node-002   172.28.96.30:8301    alive   server  0.7.0  2   datacenter2
----
Source Log Event:
2016/11/02 21:07:54 [DEBUG] memberlist: Failed UDP ping: app3-node-003 (timeout reached)
2016/11/02 21:07:54 [WARN] memberlist: Was able to reach app3-node-003 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
----
Destination Log (app3-node-003):
2016/11/02 21:07:17 [DEBUG] memberlist: Potential blocking operation. Last command took 13.016024ms
2016/11/02 21:07:19 [DEBUG] memberlist: TCP connection from=172.28.124.225:31811
2016/11/02 21:07:21 [DEBUG] memberlist: TCP connection from=172.28.211.169:13501
2016/11/02 21:07:28 [DEBUG] memberlist: Failed UDP ping: app3-node-001 (timeout reached)
2016/11/02 21:07:29 [DEBUG] memberlist: TCP connection from=172.28.71.63:55377
2016/11/02 21:07:31 [DEBUG] memberlist: Potential blocking operation. Last command took 10.157036ms
2016/11/02 21:07:36 [DEBUG] memberlist: Failed UDP ping: app5-node-002 (timeout reached)
2016/11/02 21:07:42 [DEBUG] memberlist: Failed UDP ping: app3-node-002 (timeout reached)
2016/11/02 21:07:43 [DEBUG] memberlist: Initiating push/pull sync with: 172.28.100.237:8301
2016/11/02 21:07:43 [DEBUG] memberlist: TCP connection from=172.28.237.180:22077
2016/11/02 21:07:52 [DEBUG] manager: Rebalanced 3 servers, next active server is server1-node-003 (Addr: tcp/172.28.140.12:8300) (DC: datacenter2)
2016/11/02 21:07:54 [DEBUG] memberlist: TCP connection from=172.28.96.30:47710
2016/11/02 21:07:59 [DEBUG] memberlist: TCP connection from=172.28.120.37:31817
2016/11/02 21:08:06 [DEBUG] memberlist: Potential blocking operation. Last command took 13.744667ms
2016/11/02 21:08:08 [DEBUG] memberlist: Failed UDP ping: app5-node-002 (timeout reached)
2016/11/02 21:08:13 [DEBUG] memberlist: Initiating push/pull sync with: 172.28.42.160:8301
2016/11/02 21:08:16 [DEBUG] memberlist: TCP connection from=172.28.178.145:33345
2016/11/02 21:08:22 [DEBUG] memberlist: TCP connection from=172.28.61.31:52617
2016/11/02 21:08:28 [DEBUG] memberlist: Failed UDP ping: app5-node-002 (timeout reached)
2016/11/02 21:08:31 [DEBUG] memberlist: Failed UDP ping: app4-node-002 (timeout reached)
2016/11/02 21:08:43 [DEBUG] memberlist: Initiating push/pull sync with: 172.28.201.215:8301
2016/11/02 21:08:53 [DEBUG] memberlist: TCP connection from=172.28.49.181:43939

#######################

------
Source Log:
server1-node-003   172.28.140.12:8301   alive   server  0.7.0  2   datacenter2
------
Source Log Event:
2016/11/02 21:25:02 [DEBUG] memberlist: Failed UDP ping: app3-node-003 (timeout reached)
2016/11/02 21:25:02 [WARN] memberlist: Was able to reach app3-node-003 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
------
Destination Log (app3-node-003):
2016/11/02 21:24:06 [DEBUG] serf: forgoing reconnect for random throttling
2016/11/02 21:24:16 [DEBUG] memberlist: TCP connection from=172.28.151.3:65531
2016/11/02 21:24:18 [DEBUG] memberlist: Failed UDP ping: app2-node-005 (timeout reached)
2016/11/02 21:24:19 [DEBUG] memberlist: TCP connection from=172.28.70.139:57725
2016/11/02 21:24:23 [DEBUG] memberlist: Failed UDP ping: app2-node-002 (timeout reached)
2016/11/02 21:24:24 [DEBUG] memberlist: Initiating push/pull sync with: 172.28.193.120:8301
2016/11/02 21:24:28 [DEBUG] memberlist: TCP connection from=172.28.196.61:16491
2016/11/02 21:24:30 [DEBUG] memberlist: TCP connection from=172.28.120.37:31925
2016/11/02 21:24:34 [DEBUG] memberlist: TCP connection from=172.28.61.31:52743
2016/11/02 21:24:35 [DEBUG] memberlist: Failed UDP ping: app3-node-002 (timeout reached)
2016/11/02 21:24:36 [DEBUG] serf: forgoing reconnect for random throttling
2016/11/02 21:24:37 [DEBUG] memberlist: TCP connection from=172.28.178.145:33453
2016/11/02 21:24:50 [DEBUG] memberlist: TCP connection from=172.28.11.232:57661
2016/11/02 21:24:51 [DEBUG] memberlist: Failed UDP ping: app2-node-002 (timeout reached)
2016/11/02 21:24:52 [DEBUG] memberlist: TCP connection from=172.28.151.3:1035
2016/11/02 21:24:53 [DEBUG] memberlist: TCP connection from=172.28.167.88:30977
2016/11/02 21:24:54 [DEBUG] memberlist: Initiating push/pull sync with: 172.28.70.139:8301
2016/11/02 21:24:59 [DEBUG] memberlist: TCP connection from=172.28.237.180:22193
2016/11/02 21:25:02 [DEBUG] memberlist: TCP connection from=172.28.140.12:50220
2016/11/02 21:25:03 [DEBUG] memberlist: Failed UDP ping: app1-node-01 (timeout reached)
2016/11/02 21:25:06 [DEBUG] serf: forgoing reconnect for random throttling
2016/11/02 21:25:10 [DEBUG] memberlist: TCP connection from=172.28.3.166:55265
2016/11/02 21:25:11 [DEBUG] memberlist: Failed UDP ping: app2-node-008 (timeout reached)
2016/11/02 21:25:17 [DEBUG] memberlist: Failed UDP ping: server1-node-001 (timeout reached)
2016/11/02 21:25:18 [DEBUG] agent: Service 'app3-node-003-private' in sync
2016/11/02 21:25:18 [DEBUG] agent: Service 'app3-node-003-public' in sync
2016/11/02 21:25:18 [DEBUG] agent: Node info in sync
2016/11/02 21:25:20 [DEBUG] memberlist: TCP connection from=172.28.70.139:8149
2016/11/02 21:25:24 [DEBUG] memberlist: Failed UDP ping: app2-node-005 (timeout reached)
2016/11/02 21:25:24 [DEBUG] memberlist: TCP connection from=172.28.189.139:12731
2016/11/02 21:25:25 [DEBUG] memberlist: Initiating push/pull sync with: 172.28.41.20:8301
2016/11/02 21:25:28 [DEBUG] memberlist: Failed UDP ping: app3-node-001 (timeout reached)
2016/11/02 21:25:28 [WARN] memberlist: Was able to reach app3-node-001 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
2016/11/02 21:25:29 [DEBUG] memberlist: Failed UDP ping: app4-node-002 (timeout reached)
2016/11/02 21:25:30 [DEBUG] memberlist: TCP connection from=172.28.144.253:26957
2016/11/02 21:25:31 [DEBUG] memberlist: TCP connection from=172.28.140.12:50232
2016/11/02 21:25:33 [DEBUG] memberlist: TCP connection from=172.28.61.31:52753
2016/11/02 21:25:34 [DEBUG] memberlist: TCP connection from=172.28.49.181:44061
2016/11/02 21:25:36 [DEBUG] serf: forgoing reconnect for random throttling
2016/11/02 21:25:41 [DEBUG] memberlist: Failed UDP ping: app5-node-002 (timeout reached)
2016/11/02 21:25:41 [INFO] serf: EventMemberJoin: app2-node-006 172.28.124.225
2016/11/02 21:25:43 [DEBUG] memberlist: Failed UDP ping: app5-node-001 (timeout reached)
2016/11/02 21:25:47 [DEBUG] memberlist: Failed UDP ping: app1-node-01 (timeout reached)
2016/11/02 21:25:55 [DEBUG] memberlist: Failed UDP ping: app5-node-003 (timeout reached)
2016/11/02 21:25:56 [DEBUG] memberlist: Initiating push/pull sync with: 172.28.189.139:8301
2016/11/02 21:25:58 [DEBUG] memberlist: TCP connection from=172.28.178.145:33467
2016/11/02 21:26:10 [DEBUG] memberlist: Failed UDP ping: app6-node-002 (timeout reached)
2016/11/02 21:26:12 [DEBUG] serf: messageJoinType: app2-node-006
2016/11/02 21:26:12 [DEBUG] serf: messageJoinType: app2-node-006
2016/11/02 21:26:16 [DEBUG] memberlist: Failed UDP ping: app2-node-007 (timeout reached)
2016/11/02 21:26:22 [DEBUG] memberlist: Failed UDP ping: app2-node-005 (timeout reached)
2016/11/02 21:26:26 [DEBUG] memberlist: Initiating push/pull sync with: 172.28.3.166:8301
2016/11/02 21:26:29 [DEBUG] memberlist: Failed UDP ping: app6-node-002 (timeout reached)
2016/11/02 21:26:32 [DEBUG] memberlist: TCP connection from=172.28.201.215:50126
2016/11/02 21:26:34 [DEBUG] memberlist: TCP connection from=172.28.124.225:5257
2016/11/02 21:26:35 [DEBUG] memberlist: TCP connection from=172.28.178.145:33473
2016/11/02 21:26:41 [DEBUG] manager: Rebalanced 3 servers, next active server is server1-node-001 (Addr: tcp/172.28.240.49:8300) (DC: datacenter2)
2016/11/02 21:26:49 [DEBUG] memberlist: Failed UDP ping: server1-node-001 (timeout reached)
2016/11/02 21:26:53 [DEBUG] agent: Service 'app3-node-003-private' in sync
2016/11/02 21:26:53 [DEBUG] agent: Service 'app3-node-003-public' in sync
2016/11/02 21:26:53 [DEBUG] agent: Node info in sync

SunSparc commented 7 years ago

All my consul machines are on 0.7.1 and we are still seeing frequent UDP complaints.

rfay commented 7 years ago

Despite my note, upgrading to 0.7.1 doesn't fix this at all. In fact (on kubernetes) I've had to reboot the node that it's running on to solve. I'm quite certain this is https://github.com/docker/docker/issues/8795 - but it could be solved/patched/bandaided several places in the toolchain.

Jwpe commented 7 years ago

I am seeing this issue on my Consul cluster using v0.6.4 for servers and v0.7.0 for clients. I've used ncat to confirm that both server and client are configured to receive UDP messages.

For example, sending an UDP packet from server to client:

# Server
echo 'bla' | ncat -v -u  {client_ip} 8301
Ncat: Version 7.12 ( https://nmap.org/ncat )
Ncat: Connected to {client_ip}:8301.
Ncat: 4 bytes sent, 0 bytes received in 0.00 seconds.

# Client
consul_1           |     2017/01/20 12:55:57 [ERR] memberlist: UDP msg type (98) not supported from={server_ip}:50041

So UDP packets are able to get through from server to client. I performed the procedure in reverse to determine that packets can get from client to server as well. However, my server logs are still full of:

2017/01/20 12:59:51 [WARN] memberlist: Was able to reach {client_name} via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
2017/01/20 12:59:53 [DEBUG] memberlist: Failed UDP ping: {client_name} (timeout reached)

Happy to provide consul info outputs if needs be.

sergeycherepanov commented 7 years ago

Hi, issue still present in 0.7.4, on GKE

any ideas how to fix this?

breezzz commented 7 years ago

Hi! issue also in

Consul v0.7.5
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

rfay commented 7 years ago

We're still living with this on Kubernetes 1.5.4. There's apparently a workaround introduced in Kubernetes 1.6, and the underlying docker issue doesn't have a resolution yet.

However, the warning doesn't seem to cause any great trouble beyond filling the logs with UDP complaints, and the TCP backup technique works.

Does anybody have any suggestions on how to get consul to emit less complaints? Is there a way to tell it "just go ahead and use TCP"?

rfay commented 7 years ago

Docker claims to have solved this in docker/docker#32505 - hopefully that will work its way through the system before too long.

slackpad commented 7 years ago

Going to close this out as these issues are related to network configs / firewalls or Docker, which hopefully is fixed per the above. If folks are still seeing issues we can reopen this.

kaskavalci commented 7 years ago

Experienced this issue on Kubernetes after restarting pods. Even though https://github.com/moby/moby/pull/32505 is merged, docker release is not ready for Kubernetes: https://github.com/kubernetes/kubernetes/issues/40182

@slackpad can you recommend a workaround for this?

hashicorp / consul