hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

"Failed fallback TCP ping" and "Suspect consul-consul-server-1 has failed" in simple OpenShift 4.12 installation with #17829

Open bo0ts opened 1 year ago

bo0ts commented 1 year ago

We have a fairly minimal consul installation in an OpenShift 4.12 Cluster with OpenShiftSDN using the helm chart version 1.1.2 with the following values.yaml:

global:
  enabled: false
  logLevel: "warn"
  datacenter: central2

  gossipEncryption:
    autoGenerate: true

  openshift:
    enabled: true
  acls:
    manageSystemACLs: true
    bootstrapToken:
      secretName: consul-bootstrap-token
      secretKey: token

server:
  enabled: true
  replicas: 3
  storage: 5Gi
  disruptionBudget:
    enabled: false
  storageClass: default

  connect: false

  serviceAccount:
    create: false
    name: default

  exposeService:
    enabled: false

client:
  enabled: false

connectInject:
  enabled: false

dns:
  enabled: false

ui:
  enabled: true
  service:
    enabled: true
    type: ClusterIP

The goal was to only use the KV store of consul to support a grafana installation.

The consul UI shows the nodes as healthy but all pods log warnings and errors regarding tcp timeouts. All pods run on different worker nodes. CPU and Memory of all consul pods are OK and not even close to their limits. Communication seems fine when running consul rtt. All pods are in the same namespace and there are no networkingpolicies.

Logs from the cluster:

consul-consul-server-0

2023-06-21T13:29:05.260Z [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback TCP ping: timeout 1s: read tcp 10.141.2.164:60374->10.141.4.19:8301: i/o timeout
2023-06-21T13:29:05.260Z [INFO] agent.server.memberlist.lan: memberlist: Suspect consul-consul-server-1 has failed, no acks received
2023-06-21T13:29:06.261Z [WARN] agent: error getting server health from server: server=consul-consul-server-1 error="context deadline exceeded"
2023-06-21T13:29:06.657Z [WARN] agent.server.raft: failed to contact: server-id=f5cac7ba-9d7e-4f76-623c-a8b7df5a1698 time=2.500768279s
2023-06-21T13:29:07.259Z [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback TCP ping: timeout 1s: read tcp 10.141.2.164:60388->10.141.4.19:8301: i/o timeout
2023-06-21T13:29:07.259Z [INFO] agent.server.memberlist.lan: memberlist: Suspect consul-consul-server-1 has failed, no acks received

consul-consul-server-1

2023-06-21T13:29:07.385Z [WARN] agent.server.memberlist.lan: memberlist: Refuting a suspect message (from: consul-consul-server-0)
2023-06-21T13:29:29.585Z [INFO] agent.server.memberlist.lan: memberlist: Suspect consul-consul-server-2 has failed, no acks received

consul-consul-server-1

2023-06-21T13:29:06.154Z [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback TCP ping: timeout 1s: read tcp 10.140.2.16:60382->10.141.4.19:8301: i/o timeout
2023-06-21T13:29:06.154Z [INFO] agent.server.memberlist.lan: memberlist: Suspect consul-consul-server-1 has failed, no acks received
2023-06-21T13:29:07.154Z [WARN] agent: error getting server health from server: server=consul-consul-server-1 error="context deadline exceeded"
2023-06-21T13:29:29.155Z [WARN] agent: error getting server health from server: server=consul-consul-server-1 error="context deadline exceeded"
2023-06-21T13:29:29.785Z [WARN] agent.server.memberlist.lan: memberlist: Refuting a suspect message (from: consul-consul-server-1)

Running rtt on consul-consul-server-0:

~ $ consul rtt consul-consul-server-0
Estimated consul-consul-server-0 <-> consul-consul-server-0 rtt: 0.020 ms (using LAN coordinates)
~ $ consul rtt consul-consul-server-1
Estimated consul-consul-server-1 <-> consul-consul-server-0 rtt: 0.257 ms (using LAN coordinates)
~ $ consul rtt consul-consul-server-2
Estimated consul-consul-server-2 <-> consul-consul-server-0 rtt: 0.618 ms (using LAN coordinates)

Running rtt on consul-consul-server-1:


~ $ consul rtt consul-consul-server-1
Estimated consul-consul-server-1 <-> consul-consul-server-1 rtt: 1.078 ms (using LAN coordinates)
~ $ consul rtt consul-consul-server-0
Estimated consul-consul-server-0 <-> consul-consul-server-1 rtt: 1.884 ms (using LAN coordinates)
~ $ consul rtt consul-consul-server-2
Estimated consul-consul-server-2 <-> consul-consul-server-1 rtt: 1.444 ms (using LAN coordinates)
bo0ts commented 1 year ago

I've also confirmed that communication on port 8301 between the pods is possible using nc.