Handle agent IP address changes on restart (no Raft quorum) in data recovery scenario

Montbuet commented 3 years ago

Hello,

I have a consul cluster deployed in a kubernetes cluster. When stopping the Consul cluster and starting it again, every pod has a new IP address, and the raft election fails:

[ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter b91......... 172.XX.XX.XX.:8300}" error="dial tcp ->72.XX.XX.XX.:8300: i/o timeout" (172.XX.XX.XX is assign to no one)

The only thing I can do is to delete every volume and restore a snapshot file, and everything is good again after that. But it is obviously time consuming.

How can I tell consul that the IP address of each pod changes at every restart ? Is there anything I can write in my values helm chart to do that ?

Ty and have a nice day

david-yu commented 3 years ago

Hi @Montbuet I'm going to transfer this over to our consul-k8s repo. Could you provide steps on how to reproduce the problem you are seeing? If you could provide a Helm config yaml file and some steps on reproducing that would be very helpful.

Montbuet commented 3 years ago

Hi @david-yu ,

Helm config file:

consul:
  global:
    enabled: true
    name: "consul"
    datacenter: "K8S"
    tls:
      enabled: false
      httpsOnly: false

  client:
    enabled: true
    securityContext:
      runAsNonRoot: false
      runAsGroup: 0
      runAsUser: 0
      fsGroup: 0

  server:
    replicas: 3
    bootstrap_expect: 3
    enabled: true
    securityContext:
      runAsNonRoot: false
      runAsGroup: 0
      runAsUser: 0
      fsGroup: 0
  ui:
    enabled: true

Reproduce steps:

Consuls servers and agents are deployed successfully on the cluster, composed of 9 workers.
Helm -n vault delete vault-test (Uninstalling both Vault and Consul).
Redeploy it again on the cluster, the consuls pods are running on random k8s workers (not the same as previously).
Logs --> Consul agent election errors, trying to contact the old IP from the first deployment.

What to do to:

Tell Consul to rather use the dns name than the IPs (like consul-server-0, consul-server-1 and consul-server-2) ?
Tell Consul to be smarter and that the IP addresses are not static, and it need to perform a "discovery" in the current namespace ?

If you have any clue on how to do that, I would be very grateful.

ishustava commented 3 years ago

Hey @Montbuet

What is the use case for that? Are you thinking of a disaster case when all pods go down? Right now, Consul (or rather Raft) cannot handle IP address change when the cluster does not have a quorum.

It looks like there was already some discussion on changing raft to support DNS addresses or in general accommodate the use case when the entire cluster goes down, but I don't think this was addressed yet.

@david-yu If the request is to make Consul support IP address change for the entire cluster, then I think this issue is more appropriate in hashicorp/consul as consul-helm behaves as expected.

Montbuet commented 3 years ago

Hey @Montbuet

What is the use case for that? Are you thinking of a disaster case when all pods go down? Right now, Consul (or rather Raft) cannot handle IP address change when the cluster does not have a quorum.

It looks like there was already some discussion on changing raft to support DNS addresses or in general accommodate the use case when the entire cluster goes down, but I don't think this was addressed yet.

@david-yu If the request is to make Consul support IP address change for the entire cluster, then I think this issue is more appropriate in hashicorp/consul as consul-helm behaves as expected.

Hey and thank you for your answer. Yes, this is precisely for a disaster recovery use case. Doing it by hand is time consuming, and in case something bad happen, it would be great that everything starts up without human intervention,

ishustava commented 3 years ago

Makes sense. I'll transfer it back to Consul so that it can be better tracked there. Sorry for all the back and forth!

For now, I think your best bet is to work around this manually or with scripting as this is likely a more involved change.

jkirschner-hashicorp commented 3 years ago

@Montbuet : I've attempted to reword the title to reflect the enhancement that you're requesting. Please correct if needed!

Montbuet commented 3 years ago

Hello, I spent some time to investigate the issue, and noticed something weird: Consul members command returns the good consul-server IPs but in the logs, the olds IPs are still used.

user@user:~$ kubectl -n vault get pods -o wide -w
NAME                                                  READY   STATUS    RESTARTS   AGE     IP              NODE             NOMINATED NODE   READINESS GATES
consul-5tg6h                                          0/1     Running   0          64m     172.19.43.81    k8s-worker-003   <none>           <none>
consul-br62r                                          0/1     Running   0          64m     172.19.52.222   k8s-worker-006   <none>           <none>
consul-h9flv                                          0/1     Running   0          64m     172.19.58.141   k8s-worker-001   <none>           <none>
consul-jpf7h                                          0/1     Running   0          64m     172.19.32.114   k8s-worker-010   <none>           <none>
consul-nhtzd                                          0/1     Running   0          64m     172.19.43.227   k8s-worker-009   <none>           <none>
consul-qwsqn                                          0/1     Running   0          64m     172.19.49.27    k8s-worker-005   <none>           <none>
consul-server-0                                       0/1     Running   0          4m7s    172.19.44.132   k8s-worker-003   <none>           <none>
consul-server-1                                       0/1     Running   0          4m33s   172.19.46.140   k8s-worker-002   <none>           <none>
consul-server-2                                       0/1     Running   0          3m57s   172.19.59.33    k8s-worker-001   <none>           <none>
consul-sv2nf                                          0/1     Running   0          64m     172.19.42.200   k8s-worker-008   <none>           <none>
consul-wdf5p                                          0/1     Running   0          64m     172.19.43.153   k8s-worker-004   <none>           <none>
consul-wwdfr                                          0/1     Running   0          64m     172.19.46.135   k8s-worker-002   <none>           <none>
consul-xv2gk                                          0/1     Running   0          64m     172.19.44.3     k8s-worker-007   <none>           <none>

Here are the corrects servers IPs: consul-server-0 172.19.44.132, consul-server-1 172.19.46.140, consul-server-2 172.19.59.33.

Then let see what returns the consul members command:

user@user:~$ kubectl -n vault exec -it consul-server-0 -- bin/sh`
/ # consul members
Node             Address             Status  Type    Build   Protocol  DC      Segment
consul-server-0  172.19.44.132:8301  alive   server  1.10.0  2         k8s  <all>
consul-server-1  172.19.46.140:8301  alive   server  1.10.0  2         k8s  <all>
consul-server-2  172.19.59.33:8301   alive   server  1.10.0  2         k8s  <all>
k8s-worker-001   172.19.58.141:8301  alive   client  1.10.0  2         k8s  <default>
k8s-worker-002   172.19.46.135:8301  alive   client  1.10.0  2         k8s  <default>
k8s-worker-003   172.19.43.81:8301   alive   client  1.10.0  2         k8s  <default>
k8s-worker-004   172.19.43.153:8301  alive   client  1.10.0  2         k8s  <default>
k8s-worker-005   172.19.49.27:8301   alive   client  1.10.0  2         k8s  <default>
k8s-worker-006   172.19.52.222:8301  alive   client  1.10.0  2         k8s  <default>
k8s-worker-007   172.19.44.3:8301    alive   client  1.10.0  2         k8s  <default>
k8s-worker-008   172.19.42.200:8301  alive   client  1.10.0  2         k8s  <default>
k8s-worker-009   172.19.43.227:8301  alive   client  1.10.0  2         k8s  <default>
k8s-worker-010   172.19.32.114:8301  alive   client  1.10.0  2         k8s  <default>

The IPs of consul-server-0, consul-server-1 and consul-server-2 pods match the correct ones, but here is something weird:

user@user:~$ kubectl -n vault logs consul-server-0 | tail
2021-07-27T09:09:46.064Z [WARN]  agent: Syncing node info failed.: error="No cluster leader"
2021-07-27T09:09:46.065Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
2021-07-27T09:09:50.469Z [INFO]  agent.server.raft: duplicate requestVote for same term: term=581
2021-07-27T09:09:51.349Z [WARN]  agent.server.raft: rejecting vote request since our last term is greater: candidate=172.19.59.33:8300 last-term=382 last-candidate-term=9
2021-07-27T09:09:51.349Z [INFO]  agent.server.raft: entering follower state: follower="Node at 172.19.44.132:8300 [Follower]" leader=
2021-07-27T09:09:54.321Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 6deb73b7-e732-ee1b-2e98-9535ea4bd8f1 172.19.44.160:8300}" error="dial tcp <nil>->172.19.44.160:8300: i/o timeout"
2021-07-27T09:09:59.472Z [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader=
2021-07-27T09:09:59.472Z [INFO]  agent.server.raft: entering candidate state: node="Node at 172.19.44.132:8300 [Candidate]" term=583
2021-07-27T09:09:59.475Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=6deb73b7-e732-ee1b-2e98-9535ea4bd8f1 fallback=172.19.44.160:8300 error="Could not find address for server id 6deb73b7-e732-ee1b-2e98-9535ea4bd8f1"
2021-07-27T09:10:00.457Z [INFO]  agent.server.raft: duplicate requestVote for same term: term=583

Full Log: https://pastebin.com/8ZP3zfSZ

EDIT: I think I found where does the issue come from: My consul cluster is supported by 3 PVC: One is mapped on a NFS volume (consul-0), and the two other are a local volume (consul-1/ and consul-2/ for instance) on the running node.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: consul
provisioner: kubernetes.io/no-provisioner
mountOptions:
  - uid=1000
  - gid=1000

---
kind: PersistentVolume
apiVersion: v1
metadata:
  name: consul-pv-volume-0
  labels:
    type: local
spec:
  storageClassName: consul
  capacity:
    storage: 15Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "{{ $.Values.global.nfs_storage }}consul-0"

---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-vault-consul-server-0
  labels:
    app: consul-storage-claim
spec:
  storageClassName: consul
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

---

kind: PersistentVolume
apiVersion: v1
metadata:
 name: consul-pv-volume-1
 labels:
   type: local
spec:
 storageClassName: consul
 capacity:
   storage: 15Gi
 accessModes:
   - ReadWriteOnce
 hostPath:
   path: "/opt/storage/consul-1"

---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-vault-consul-server-1
  labels:
    app: vault-storage-claim
spec:
  storageClassName: consul
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

---

kind: PersistentVolume
apiVersion: v1
metadata:
 name: consul-pv-volume-2
 labels:
   type: local
spec:
 storageClassName: consul
 capacity:
   storage: 15Gi
 accessModes:
   - ReadWriteOnce
 hostPath:
   path: "/opt/storage/consul-2"

---          

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-vault-consul-server-2
  labels:
    app: consul-storage-claim
spec:
  storageClassName: consul
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

When consul-1 or consul-2 pod is deleted and is created again on a different node, /opt/storage/consul-1/ or consul-2/ is created again on this local node. In this directory, a new node ID is created. So what I understand is: Their is a mismatch between old and new node-ID, and the consul cluster does not want to start without joining the old ones.

Workaround: Consul pod has to use volume containing an existing node-id file. What I did is just writing consul-1 and consul-2 node-id (no need for consul-0, because it runs on NFS) on every consul-1/ and consul-2/ local volumes on every node in my k8s cluster. Right now I am working on a better fix.

hashicorp / consul

Handle agent IP address changes on restart (no Raft quorum) in data recovery scenario #10678