After the three nodes are restarted, consul cannot provide services, and each consul is caught in an endless election cycle

jackin853 commented 2 years ago

After the three nodes are restarted, consul cannot provide services, and each consul is caught in an endless election cycle In the kubernetes environment, using statefulset to deploy three instances of consul, everything runs normally, until after the three nodes are restarted, consul cannot provide services, and each consul is caught in an endless election cycle.Below is my statefulset configuration:

apiVersion: apps/v1 kind: StatefulSet metadata: name: test-consul-statefulset namespace: test labels: app: test-consul-statefulset component: test-consul-server spec: serviceName: test-consul-headless replicas: 3 selector: matchLabels: app: test-consul-statefulset component: test-consul-server template: metadata: labels: app: test-consul-statefulset component: test-consul-server spec: serviceAccountName: test-consul-service-account nodeSelector: test-label: test-label affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution:

labelSelector: matchExpressions:
- key: app operator: In values:
  - test-consul-statefulset topologyKey: kubernetes.io/hostname terminationGracePeriodSeconds: 600 volumes:
    - name: data hostPath: path: /test_data/consul_data
    - name: host-time hostPath: path: /etc/localtime containers:
    - name: test-consul image: consul:1.9.16 imagePullPolicy: IfNotPresent resources: limits: cpu: "2" memory: 1Gi requests: cpu: 100m memory: 300Mi args:
    - "agent"
    - "-server"
    - "-ui"
    - "-bootstrap-expect=3"
    - "-data-dir=/consul/data"
    - "-datacenter=testdc"
    - "-bind=0.0.0.0"
    - "-client=0.0.0.0"
    - "-disable-host-node-id=true"
    - "-domain=cluster.consul"
    - "-advertise=$(PODIP)"
    - "-retry-join=test-consul-statefulset-0.test-consul-headless.$(NAMESPACE).svc.cluster.local"
    - "-retry-join=test-consul-statefulset-1.test-consul-headless.$(NAMESPACE).svc.cluster.local"
    - "-retry-join=test-consul-statefulset-2.test-consul-headless.$(NAMESPACE).svc.cluster.local" volumeMounts:
    - name: data mountPath: /consul/data
    - name: host-time mountPath: /etc/localtime env:
    - name: PODIP valueFrom: fieldRef: fieldPath: status.podIP
    - name: NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace livenessProbe: httpGet: path: / port: 8500 initialDelaySeconds: 180 periodSeconds: 30 readinessProbe: httpGet: path: / port: 8500 initialDelaySeconds: 5 timeoutSeconds: 10 ports:
    - containerPort: 8500 name: http
    - containerPort: 8300 name: server
    - containerPort: 8301 name: serflan
    - containerPort: 8302 name: serfwan
    - containerPort: 8400 name: cli-port
    - containerPort: 8600 name: consuldns

By entering each container and using "consul members", I can only see itself, and the log of each pod is the following output (sorry the original log is no longer there, the server has been emptied):

2022-03-25T22:53:31.731Z [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-03-25T22:53:31.731Z [INFO]  agent.server.raft: entering candidate state: node="Node at 177.177.136.166:8300 [Candidate]" term=177667
2022-03-25T22:53:31.829Z [WARN]  agent.server.raft: unable to get address for sever, using fallback address: id=575bbaa9-7a22-886e-c431-bb6d762ebafc fallback=177.177.238.20:8300 error="Could not find address for server id 575bbaa9-7a22-886e-c431-bb6d762ebafc"
2022-03-25T22:53:31.829Z [WARN]  agent.server.raft: unable to get address for sever, using fallback address: id=19067efc-dfc8-c8b9-dd27-6fe28fb96097 fallback=177.177.162.4:8300 error="Could not find address for server id 19067efc-dfc8-c8b9-dd27-6fe28fb96097"
2022-03-25T22:53:31.830Z [ERROR] agent.server.raft: Failed to make RequestVote RPC: target="{Voter 575bbaa9-7a22-886e-c431-bb6d762ebafc 177.177.238.20:8300}" error="dial tcp <nil>->177.177.238.20:8300: connection: connection refused"
2022-03-25T22:53:31.830Z [ERROR] agent.server.raft: Failed to make RequestVote RPC: target="{Voter 19067efc-dfc8-c8b9-dd27-6fe28fb96097 177.177.162.4:8300}" error="dial tcp <nil>->177.177.162.4:8300: connection: connection refused"
2022-03-25T22:53:37.830Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/.................. error="No cluster leader"
2022-03-25T22:59:31.731Z [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-03-25T22:59:31.731Z [INFO]  agent.server.raft: entering candidate state: node="Node at 177.177.136.166:8300 [Candidate]" term=177668
2022-03-25T22:59:31.829Z [WARN]  agent.server.raft: unable to get address for sever, using fallback address: id=575bbaa9-7a22-886e-c431-bb6d762ebafc fallback=177.177.238.20:8300 error="Could not find address for server id 575bbaa9-7a22-886e-c431-bb6d762ebafc"
2022-03-25T22:59:31.829Z [WARN]  agent.server.raft: unable to get address for sever, using fallback address: id=19067efc-dfc8-c8b9-dd27-6fe28fb96097 fallback=177.177.162.4:8300 error="Could not find address for server id 19067efc-dfc8-c8b9-dd27-6fe28fb96097"
2022-03-25T22:59:31.830Z [ERROR] agent.server.raft: Failed to make RequestVote RPC: target="{Voter 575bbaa9-7a22-886e-c431-bb6d762ebafc 177.177.238.20:8300}" error="dial tcp <nil>->177.177.238.20:8300: connection: connection refused"
2022-03-25T22:59:31.830Z [ERROR] agent.server.raft: Failed to make RequestVote RPC: target="{Voter 19067efc-dfc8-c8b9-dd27-6fe28fb96097 177.177.162.4:8300}" error="dial tcp <nil>->177.177.162.4:8300: connection: connection refused"
2022-03-25T23:02:31.830Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/.................. error="No cluster leader"
2022-03-25T23:02:34.830Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/.................. error="No cluster leader"
2022-03-25T23:02:36.830Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/.................. error="No cluster leader"
2022-03-25T23:04:31.731Z [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-03-25T23:04:31.731Z [INFO]  agent.server.raft: entering candidate state: node="Node at 177.177.136.166:8300 [Candidate]" term=177669
2022-03-25T23:04:31.829Z [WARN]  agent.server.raft: unable to get address for sever, using fallback address: id=575bbaa9-7a22-886e-c431-bb6d762ebafc fallback=177.177.238.20:8300 error="Could not find address for server id 575bbaa9-7a22-886e-c431-bb6d762ebafc"
2022-03-25T23:04:31.829Z [WARN]  agent.server.raft: unable to get address for sever, using fallback address: id=19067efc-dfc8-c8b9-dd27-6fe28fb96097 fallback=177.177.162.4:8300 error="Could not find address for server id 19067efc-dfc8-c8b9-dd27-6fe28fb96097"
2022-03-25T23:04:31.830Z [ERROR] agent.server.raft: Failed to make RequestVote RPC: target="{Voter 575bbaa9-7a22-886e-c431-bb6d762ebafc 177.177.238.20:8300}" error="dial tcp <nil>->177.177.238.20:8300: connection: connection refused"
2022-03-25T23:04:31.830Z [ERROR] agent.server.raft: Failed to make RequestVote RPC: target="{Voter 19067efc-dfc8-c8b9-dd27-6fe28fb96097 177.177.162.4:8300}" error="dial tcp <nil>->177.177.162.4:8300: connection: connection refused"

I'm wondering why each consul pod is caught in an endless loop of elections, and retry-join has been set up, why can't they see each other? I delete the corresponding statefulset through “kubectl delete sts”, the data is still retained, and then execute “kubectl create -f statefulset.yaml”, consul runs normally again, and the leader is successfully elected. I am now wondering if there is something wrong with my configuration, or something else? Hope to get help.

jackin853 commented 2 years ago

This seems to be an accidental phenomenon. I have tried the same operation several times, but I cannot reproduce the problem.

Amier3 commented 2 years ago

Hey @jackin853

Sounds like an odd issue. How often are you experiencing this? Looking at some of the past issue we've come across, it seems like this is a lot similar to the behavior seen in #7750. The fix for that issue is outlined in #2868 and has some workarounds in the comments that may work for you 🤞

Let me know if that sounds like what you're experiencing and if the workaround ( using bootstrap_expect=3 ) works for you

jackin853 commented 2 years ago

@Amier3 Thanks，I will look forward to #2868 Here, I have another question, I want to know how to configure the log output of consul, through the above statefuset deployment method, I can't find the relevant log output directory, I want to redirect it to a local file, because Currently we have not implemented a dynamic pod log phone based on EFK

hashicorp / consul

After the three nodes are restarted, consul cannot provide services, and each consul is caught in an endless election cycle #12654