Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
We have a fairly minimal consul installation in an OpenShift 4.12 Cluster with OpenShiftSDN using the helm chart version 1.1.2 with the following values.yaml:
The goal was to only use the KV store of consul to support a grafana installation.
The consul UI shows the nodes as healthy but all pods log warnings and errors regarding tcp timeouts. All pods run on different worker nodes. CPU and Memory of all consul pods are OK and not even close to their limits. Communication seems fine when running consul rtt. All pods are in the same namespace and there are no networkingpolicies.
Logs from the cluster:
consul-consul-server-0
2023-06-21T13:29:05.260Z [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback TCP ping: timeout 1s: read tcp 10.141.2.164:60374->10.141.4.19:8301: i/o timeout
2023-06-21T13:29:05.260Z [INFO] agent.server.memberlist.lan: memberlist: Suspect consul-consul-server-1 has failed, no acks received
2023-06-21T13:29:06.261Z [WARN] agent: error getting server health from server: server=consul-consul-server-1 error="context deadline exceeded"
2023-06-21T13:29:06.657Z [WARN] agent.server.raft: failed to contact: server-id=f5cac7ba-9d7e-4f76-623c-a8b7df5a1698 time=2.500768279s
2023-06-21T13:29:07.259Z [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback TCP ping: timeout 1s: read tcp 10.141.2.164:60388->10.141.4.19:8301: i/o timeout
2023-06-21T13:29:07.259Z [INFO] agent.server.memberlist.lan: memberlist: Suspect consul-consul-server-1 has failed, no acks received
consul-consul-server-1
2023-06-21T13:29:07.385Z [WARN] agent.server.memberlist.lan: memberlist: Refuting a suspect message (from: consul-consul-server-0)
2023-06-21T13:29:29.585Z [INFO] agent.server.memberlist.lan: memberlist: Suspect consul-consul-server-2 has failed, no acks received
consul-consul-server-1
2023-06-21T13:29:06.154Z [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback TCP ping: timeout 1s: read tcp 10.140.2.16:60382->10.141.4.19:8301: i/o timeout
2023-06-21T13:29:06.154Z [INFO] agent.server.memberlist.lan: memberlist: Suspect consul-consul-server-1 has failed, no acks received
2023-06-21T13:29:07.154Z [WARN] agent: error getting server health from server: server=consul-consul-server-1 error="context deadline exceeded"
2023-06-21T13:29:29.155Z [WARN] agent: error getting server health from server: server=consul-consul-server-1 error="context deadline exceeded"
2023-06-21T13:29:29.785Z [WARN] agent.server.memberlist.lan: memberlist: Refuting a suspect message (from: consul-consul-server-1)
Running rtt on consul-consul-server-0:
~ $ consul rtt consul-consul-server-0
Estimated consul-consul-server-0 <-> consul-consul-server-0 rtt: 0.020 ms (using LAN coordinates)
~ $ consul rtt consul-consul-server-1
Estimated consul-consul-server-1 <-> consul-consul-server-0 rtt: 0.257 ms (using LAN coordinates)
~ $ consul rtt consul-consul-server-2
Estimated consul-consul-server-2 <-> consul-consul-server-0 rtt: 0.618 ms (using LAN coordinates)
Running rtt on consul-consul-server-1:
~ $ consul rtt consul-consul-server-1
Estimated consul-consul-server-1 <-> consul-consul-server-1 rtt: 1.078 ms (using LAN coordinates)
~ $ consul rtt consul-consul-server-0
Estimated consul-consul-server-0 <-> consul-consul-server-1 rtt: 1.884 ms (using LAN coordinates)
~ $ consul rtt consul-consul-server-2
Estimated consul-consul-server-2 <-> consul-consul-server-1 rtt: 1.444 ms (using LAN coordinates)
We have a fairly minimal consul installation in an OpenShift 4.12 Cluster with OpenShiftSDN using the helm chart version
1.1.2
with the followingvalues.yaml
:The goal was to only use the KV store of consul to support a grafana installation.
The consul UI shows the nodes as healthy but all pods log warnings and errors regarding tcp timeouts. All pods run on different worker nodes. CPU and Memory of all consul pods are OK and not even close to their limits. Communication seems fine when running
consul rtt
. All pods are in the same namespace and there are no networkingpolicies.Logs from the cluster:
consul-consul-server-0
consul-consul-server-1
consul-consul-server-1
Running rtt on consul-consul-server-0:
Running rtt on consul-consul-server-1: