Raft rejoin issue - Githubissues

ngarafol commented 4 years ago

Describe the bug Using vault helm charts, with raft and ha setup. After unsealing and joining peers to raft, deleting one of the pods makes it unable to rejoin raft cluster and other nodes still try to communicate with old pod.

To Reproduce Steps to reproduce the behavior:

helm install vault -f values-raft.yaml .
kubectl exec vault-0 -- /bin/sh vault operator init -recovery-shares=1 -recovery-threshold=1 > /vault/data/recovery-key.txt
kubectl exec -ti vault-1 -- vault operator raft join http://vault-0.vault-headless:8200 kubectl exec -ti vault-2 -- vault operator raft join http://vault-0.vault-headless:8200
kubectl delete pods vault-1
kubectl logs vault-0

Expected behavior New node should be able to rejoin raft cluster and other nodes should stop using old raft node.

Environment:

Vault Server Version : 1.3.2 / 1.4.0-beta1
Vault CLI Version : 1.3.2 / 1.4.0-beta1
Server Operating System/Architecture: kubernetes

Vault server configuration file(s):

server:

# extraEnvironmentVars is a list of extra enviroment variables to set with the stateful set. These could be
# used to include variables required for auto-unseal.
  image:
    repository: "vault"
    tag: "1.4.0-beta1"
    # Overrides the default Image Pull Policy
    pullPolicy: IfNotPresent

  extraEnvironmentVars:
    VAULT_TOKEN: <used for transit unseal>

  ha:
    enabled: true

    raft:
      enabled: true

    service:
      enabled: true
      headless:
        enabled: true

    config: |
      ui = true
      cluster_addr = "http://POD_IP:8201"
      api_addr = "http://vault-0.vault-headless:8200"

      listener "tcp" {
        tls_disable = 1
        address = "[::]:8200"
        cluster_address = "[::]:8201"
      }

      log_level = "Debug"

      storage "raft" {
        path = "/vault/data"
        node_id = "POD_IP:8201"
      }

      seal "transit" {
        address = "https://api.example.org"
        disable_renewal = "false"
        key_name = "examplekey"
        mount_path = "transit/"
        tls_skip_verify = "true"
      }

injector:
  # True if you want to enable vault agent injection.
  enabled: false

Using instructions from here: https://github.com/hashicorp/vault-helm/issues/40

logs and k8s info show deleting vault-2 pod and vault-0 pod still using old node_id. Same behaviour is with 1.4.0-beta1.

I managed to run raft remove-peer to remove old peer, but still cant rejoin and dont know how to proceed so need some guidance.

$ kubectl get all -o wide
NAME          READY   STATUS    RESTARTS   AGE     IP              
pod/vault-0   1/1     Running   0          7m50s   10.10.117.75   
pod/vault-1   1/1     Running   0          7m52s   10.10.112.90   
pod/vault-2   1/1     Running   0          7m51s   10.10.109.83   

NAME                     TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE   SELECTOR
service/vault            ClusterIP   10.10.54.144   <none>        8200/TCP,8201/TCP   11m   app.kubernetes.io/instance=vault,app.kubernetes.io/name=vault,component=server
service/vault-headless   ClusterIP   None            <none>        8200/TCP,8201/TCP   11m   app.kubernetes.io/instance=vault,app.kubernetes.io/name=vault,component=server

NAME                     READY   AGE   CONTAINERS   IMAGES
statefulset.apps/vault   3/3     11m   vault        vault:1.3.2

$ kubectl describe service vault-headless
Name:              vault-headless
Namespace:         default
Labels:            app.kubernetes.io/instance=vault
                   app.kubernetes.io/managed-by=Helm
                   app.kubernetes.io/name=vault
                   helm.sh/chart=vault-0.4.0
Annotations:       service.alpha.kubernetes.io/tolerate-unready-endpoints: true
Selector:          app.kubernetes.io/instance=vault,app.kubernetes.io/name=vault,component=server
Type:              ClusterIP
IP:                None
Port:              http  8200/TCP
TargetPort:        8200/TCP
Endpoints:         10.10.109.83:8200,10.10.112.90:8200,10.10.117.75:8200
Port:              internal  8201/TCP
TargetPort:        8201/TCP
Endpoints:         10.10.109.83:8201,10.10.112.90:8201,10.10.117.75:8201
Session Affinity:  None
Events:            <none>

$ kubectl describe service vault
Name:              vault
Namespace:         default
Labels:            app.kubernetes.io/instance=vault
                   app.kubernetes.io/managed-by=Helm
                   app.kubernetes.io/name=vault
                   helm.sh/chart=vault-0.4.0
Annotations:       service.alpha.kubernetes.io/tolerate-unready-endpoints: true
Selector:          app.kubernetes.io/instance=vault,app.kubernetes.io/name=vault,component=server
Type:              ClusterIP
IP:                10.10.54.144
Port:              http  8200/TCP
TargetPort:        8200/TCP
Endpoints:         10.10.109.83:8200,10.10.112.90:8200,10.10.117.75:8200
Port:              internal  8201/TCP
TargetPort:        8201/TCP
Endpoints:         10.10.109.83:8201,10.10.112.90:8201,10.10.117.75:8201
Session Affinity:  None
Events:            <none>

$ kubectl exec vault-2 vault status
Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    1
Threshold                1
Version                  1.3.2
Cluster Name             vault-cluster-175d901f
Cluster ID               f034ecf4-0a4b-f103-fa8c-12b8c8ce2e3b
HA Enabled               true
HA Cluster               https://10.10.117.75:8201
HA Mode                  standby
Active Node Address      http://10.10.117.75:8200

$ kubectl delete pods vault-2
pod "vault-2" deleted
$ kubectl get all -o wide
NAME          READY   STATUS              RESTARTS   AGE   IP              
pod/vault-0   1/1     Running             0          14m   10.10.117.75
pod/vault-1   1/1     Running             0          14m   10.10.112.90
pod/vault-2   0/1     ContainerCreating   0          4s    <none>        

NAME                     TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE   SELECTOR
service/kubernetes       ClusterIP   10.10.0.1      <none>        443/TCP             36d   <none>
service/vault            ClusterIP   10.10.54.144   <none>        8200/TCP,8201/TCP   17m   app.kubernetes.io/instance=vault,app.kubernetes.io/name=vault,component=server
service/vault-headless   ClusterIP   None            <none>        8200/TCP,8201/TCP   17m   app.kubernetes.io/instance=vault,app.kubernetes.io/name=vault,component=server

NAME                     READY   AGE   CONTAINERS   IMAGES
statefulset.apps/vault   2/3     17m   vault        vault:1.3.2

$ kubectl get all -o wide
NAME          READY   STATUS    RESTARTS   AGE   IP              
pod/vault-0   1/1     Running   0          14m   10.10.117.75  
pod/vault-1   1/1     Running   0          14m   10.10.112.90   
pod/vault-2   1/1     Running   0          35s   10.10.109.84  

NAME                     TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE   SELECTOR
service/kubernetes       ClusterIP   10.10.0.1      <none>        443/TCP             36d   <none>
service/vault            ClusterIP   10.10.54.144   <none>        8200/TCP,8201/TCP   17m   app.kubernetes.io/instance=vault,app.kubernetes.io/name=vault,component=server
service/vault-headless   ClusterIP   None            <none>        8200/TCP,8201/TCP   17m   app.kubernetes.io/instance=vault,app.kubernetes.io/name=vault,component=server

NAME                     READY   AGE   CONTAINERS   IMAGES
statefulset.apps/vault   3/3     17m   vault        vault:1.3.2

$ kubectl exec vault-2 vault status
Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    1
Threshold                1
Version                  1.3.2
Cluster Name             vault-cluster-175d901f
Cluster ID               f034ecf4-0a4b-f103-fa8c-12b8c8ce2e3b
HA Enabled               true
HA Cluster               https://10.10.117.75:8201
HA Mode                  standby
Active Node Address      http://10.10.117.75:8200

2020-03-06T10:51:44.873Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.83:8201 time=2m30.196881968s
2020-03-06T10:51:44.970Z [ERROR] storage.raft: failed to heartbeat to: peer=10.10.109.83:8201 error="dial tcp 10.10.109.83:8201: i/o timeout"
2020-03-06T10:51:46.800Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter 10.10.109.83:8201 10.10.109.83:8201}" error="dial tcp 10.10.109.83:8201: i/o timeout"
2020-03-06T10:51:47.348Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.83:8201 time=2m32.671358765s
2020-03-06T10:51:49.835Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.83:8201 time=2m35.158308071s
2020-03-06T10:51:52.312Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.83:8201 time=2m37.635346099s
2020-03-06T10:51:54.763Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.83:8201 time=2m40.086193861s
2020-03-06T10:51:56.096Z [DEBUG] storage.raft.stream: creating rpc dialer: host=raft-65724cee-d53e-da4c-b45b-2347fc59e9ee
2020-03-06T10:51:57.090Z [DEBUG] storage.raft.stream: creating rpc dialer: host=raft-65724cee-d53e-da4c-b45b-2347fc59e9ee
2020-03-06T10:51:57.196Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.83:8201 time=2m42.519957448s
2020-03-06T10:51:59.641Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.83:8201 time=2m44.964353521s
2020-03-06T10:52:02.075Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.83:8201 time=2m47.398880975s
2020-03-06T10:52:04.532Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.83:8201 time=2m49.855738741s
2020-03-06T10:52:06.096Z [ERROR] storage.raft: failed to heartbeat to: peer=10.10.109.83:8201 error="dial tcp 10.10.109.83:8201: i/o timeout"
2020-03-06T10:52:06.974Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.83:8201 time=2m52.297888762s
2020-03-06T10:52:07.091Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter 10.10.109.83:8201 10.10.109.83:8201}" error="dial tcp 10.10.109.83:8201: i/o timeout"
2020-03-06T10:52:09.444Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.83:8201 time=2m54.767375921s
2020-03-06T10:52:11.899Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.83:8201 time=2m57.222612534s

2020-03-06T12:12:54.494Z [DEBUG] core.cluster-listener: performing client cert lookup
2020-03-06T12:12:56.593Z [DEBUG] core.cluster-listener: performing server cert lookup
2020-03-06T12:12:56.679Z [DEBUG] core.request-forward: got request forwarding connection

2020-03-06T12:14:31.567Z [INFO]  storage.raft: aborting pipeline replication: peer="{Voter 10.10.109.93:8201 10.10.109.93:8201}"
2020-03-06T12:14:31.620Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter 10.10.109.93:8201 10.10.109.93:8201}" error=EOF
2020-03-06T12:14:31.686Z [DEBUG] core.cluster-listener: creating rpc dialer: alpn=raft_storage_v1 host=raft-3413b47b-735b-f375-1bc6-018a7d0a77c9
2020-03-06T12:14:31.687Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter 10.10.109.93:8201 10.10.109.93:8201}" error="dial tcp 10.10.109.93:8201: connect: connection refused"
2020-03-06T12:14:31.704Z [DEBUG] core.cluster-listener: creating rpc dialer: alpn=raft_storage_v1 host=raft-3413b47b-735b-f375-1bc6-018a7d0a77c9
2020-03-06T12:14:31.705Z [ERROR] storage.raft: failed to heartbeat to: peer=10.10.109.93:8201 error="dial tcp 10.10.109.93:8201: connect: connection refused"
2020-03-06T12:14:31.763Z [DEBUG] core.cluster-listener: creating rpc dialer: alpn=raft_storage_v1 host=raft-3413b47b-735b-f375-1bc6-018a7d0a77c9
2020-03-06T12:14:31.764Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter 10.10.109.93:8201 10.10.109.93:8201}" error="dial tcp 10.10.109.93:8201: connect: connection refused"
2020-03-06T12:14:31.879Z [DEBUG] core.cluster-listener: creating rpc dialer: alpn=raft_storage_v1 host=raft-3413b47b-735b-f375-1bc6-018a7d0a77c9
2020-03-06T12:14:31.880Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter 10.10.109.93:8201 10.10.109.93:8201}" error="dial tcp 10.10.109.93:8201: connect: connection refused"
2020-03-06T12:14:31.977Z [DEBUG] core.cluster-listener: creating rpc dialer: alpn=raft_storage_v1 host=raft-3413b47b-735b-f375-1bc6-018a7d0a77c9
2020-03-06T12:14:32.466Z [DEBUG] core.cluster-listener: creating rpc dialer: alpn=raft_storage_v1 host=raft-3413b47b-735b-f375-1bc6-018a7d0a77c9
2020-03-06T12:14:34.067Z [WARN]  storage.raft: failed to contact: server-id=10.10.109.93:8201 time=2.500135648s
2020-03-06T12:14:36.560Z [WARN]  storage.raft: failed to contact: server-id=10.10.109.93:8201 time=4.993262183s
2020-03-06T12:14:39.051Z [WARN]  storage.raft: failed to contact: server-id=10.10.109.93:8201 time=7.48483423s
2020-03-06T12:14:41.506Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.93:8201 time=9.9395677s
2020-03-06T12:14:41.978Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter 10.10.109.93:8201 10.10.109.93:8201}" error="dial tcp 10.10.109.93:8201: i/o timeout"
2020-03-06T12:14:42.134Z [DEBUG] core.cluster-listener: creating rpc dialer: alpn=raft_storage_v1 host=raft-3413b47b-735b-f375-1bc6-018a7d0a77c9
2020-03-06T12:14:42.466Z [ERROR] storage.raft: failed to heartbeat to: peer=10.10.109.93:8201 error="dial tcp 10.10.109.93:8201: i/o timeout"
2020-03-06T12:14:43.095Z [DEBUG] core.cluster-listener: creating rpc dialer: alpn=raft_storage_v1 host=raft-3413b47b-735b-f375-1bc6-018a7d0a77c9
2020-03-06T12:14:43.930Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.93:8201 time=12.363453907s
2020-03-06T12:14:46.410Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.93:8201 time=14.843281983s
2020-03-06T12:14:48.764Z [DEBUG] core.cluster-listener: performing server cert lookup
2020-03-06T12:14:48.883Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.93:8201 time=17.316212078s
2020-03-06T12:14:48.897Z [DEBUG] core.request-forward: got request forwarding connection
2020-03-06T12:14:51.326Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.93:8201 time=19.759459187s
2020-03-06T12:14:52.134Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter 10.10.109.93:8201 10.10.109.93:8201}" error="dial tcp 10.10.109.93:8201: i/o timeout"
2020-03-06T12:14:52.295Z [DEBUG] core.cluster-listener: creating rpc dialer: alpn=raft_storage_v1 host=raft-3413b47b-735b-f375-1bc6-018a7d0a77c9
2020-03-06T12:14:53.095Z [ERROR] storage.raft: failed to heartbeat to: peer=10.10.109.93:8201 error="dial tcp 10.10.109.93:8201: i/o timeout"
2020-03-06T12:14:53.757Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.93:8201 time=22.190566992s
2020-03-06T12:14:53.999Z [DEBUG] core.cluster-listener: creating rpc dialer: alpn=raft_storage_v1 host=raft-3413b47b-735b-f375-1bc6-018a7d0a77c9
2020-03-06T12:14:56.247Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.93:8201 time=24.680257427s
2020-03-06T12:14:58.713Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.93:8201 time=27.146831076s
2020-03-06T12:15:01.174Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.93:8201 time=29.607497912s
2020-03-06T12:15:02.295Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter 10.10.109.93:8201 10.10.109.93:8201}" error="dial tcp 10.10.109.93:8201: i/o timeout"
2020-03-06T12:15:02.699Z [DEBUG] core.cluster-listener: creating rpc dialer: alpn=raft_storage_v1 host=raft-3413b47b-735b-f375-1bc6-018a7d0a77c9
2020-03-06T12:15:03.633Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.93:8201 time=32.066521407s
2020-03-06T12:15:04.000Z [ERROR] storage.raft: failed to heartbeat to: peer=10.10.109.93:8201 error="dial tcp 10.10.109.93:8201: i/o timeout"
2020-03-06T12:15:04.804Z [DEBUG] core.cluster-listener: creating rpc dialer: alpn=raft_storage_v1 host=raft-3413b47b-735b-f375-1bc6-018a7d0a77c9
2020-03-06T12:15:06.053Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.93:8201 time=34.48597863s
2020-03-06T12:15:08.540Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.93:8201 time=36.97298619s
2020-03-06T12:15:11.016Z [DEBUG] storage.raft: failed to contact: server-id=10.10.109.93:8201 time=39.448944506s

catsby commented 4 years ago

Hello -

This may be an issue with how the pod is destroyed according to the helm chart, but I'm not 100% versed in helm and kubernetes so maybe I'm wrong 😄

https://github.com/hashicorp/vault-helm/blob/9d92922c9dc1500642278b172a7150c32534de0b/templates/server-statefulset.yaml#L124-L136

It seems we simply kill the process, which is normally fine for Vault, but with Raft I believe we need another step:

$ vault operator raft remove-peer <peer id>

https://learn.hashicorp.com/vault/operations/raft-storage-aws#remove-a-cluster-member

cc @jasonodonnell

This issue may be more appropriate to be on hashicorp/vault-helm but we can leave it here for now until there's a bit more investigation.

Thanks!

jasonodonnell commented 4 years ago

Hi @ngarafol,

The following environment variable needs to be added to the Vault StatefulSet for this to work:

- name: VAULT_CLUSTER_ADDR
  value: "https://$(HOSTNAME):8201"

This will change Vault to use dns instead of IP addresses when tracking nodes in the cluster.

Hope that helps!

ngarafol commented 4 years ago

I had to do more modifications. If I do as @jasonodonnell proposed, hostname is not getting substituted, so vault status reads:

/ $ vault status
Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    1
Threshold                1
Version                  1.4.0-beta1
Cluster Name             vault-cluster-769d437c
Cluster ID               f0c86e29-32aa-9626-745d-11c8fc5c9083
HA Enabled               true
HA Cluster               https://$(HOSTNAME):8201
HA Mode                  active

So I had to add:

   - name: HOST_NAME$
              valueFrom:$
                fieldRef:$
                  fieldPath: metadata.name$

and

- name: VAULT_CLUSTER_ADDR$
              value: "https://$(HOST_NAME).vault-headless:8201"$

inside env of server-statefulset yaml template file. Also, note that I used $(HOST_NAME).vault-headless because that is only record that resolves inside pods. Using only hostname wont resolve for some weird reason.

And now after running kubectl delete pod vault-2 I ran into this (log from vault-0):

2020-03-09T11:19:10.225Z [DEBUG] storage.raft: failed to contact: server-id=$(10.10.117.96) time=32.177825809s
2020-03-09T11:19:11.128Z [DEBUG] core.cluster-listener: creating rpc dialer: alpn=raft_storage_v1 host=raft-64efb27c-0356-3e72-890b-d1d68148edc6
2020-03-09T11:19:11.154Z [ERROR] storage.raft: failed to heartbeat to: peer=vault-2.vault-headless:8201 error="dial tcp: lookup vault-2.vault-headless on 10.10.0.3:53: no such host"
2020-03-09T11:19:12.707Z [DEBUG] storage.raft: failed to contact: server-id=$(10.10.117.96) time=34.659817361s

Somehow it wont resolve vault-2.vault-headless from vault-0 pod but nslookup works inside pod:

$ kubectl exec -it vault-0 nslookup vault-2.vault-headless 10.10.0.3
Server:    10.10.0.3
Address 1: 10.10.0.3 coredns.kube-system.svc.in....

Name:      vault-2.vault-headless
Address 1: 10.10.117.97 10-10-117-97.vault.default.svc.in....

Raft config is showing three nodes, but one with wrong ip (the one before pod deletion).

Other than deleting pod, what would be appropriate simulation of pod missing/gone?

EDIT: Seems I got it working, but have to use hostname instead of pod_ip as node_id in the future to avoid confusion :nerd_face:

webmutation commented 4 years ago

Hi @ngarafol Can you update the issue with a more clear set of instructions. Also, is raft backend working well for you? I would really like to move towards raft and move away from consul.

ngarafol commented 4 years ago

Hi @webmutation. This issue is based on instructions from @jasonodonnell listed here https://github.com/hashicorp/vault-helm/issues/40

Basically, you need vault-helm master and pull files from https://github.com/hashicorp/vault-helm/pull/58 and merge locally. I am using transit unseal, but it doesnt matter how you unseal vault.

Since its safer to use hostname than ip address, you can edit settings like I did here: https://github.com/hashicorp/vault/issues/8489#issuecomment-596484299

Regarding raft itself, I have been using it for few days total so cant comment at the moment. We also have consul backed setup, but I am testing raft one.

If all this is too brief still, @jasonodonnell or me can try to write more detailed guide, when time permits.

webmutation commented 4 years ago

Thanks. That should be enough to get me going.

The part that was less clear to me was the settings file, I am unsure on what changed regarding the hostname, if it was only what you commented or if you did something more, indeed it would seem that DNS instead of IP would be the only way to resolve... but my main worry is what happens if the pod gets rescheduled and is not terminated with vault operator raft remove-peer <peer id> did you observe any split-brain situations so far ?

Yeah we use consul as well, but it's a bit overkill, I know it is the best supported backed, but it seems like huge overkill. Embedded Raft would be more lightweight and easier to manage.

ngarafol commented 4 years ago

@catsby

Not 100% sure but thinking aloud, removing raft peer is not necessary here. I wanted peer with same id (and new ip) to return to cluster. If you remove it, you have to manually connect peer to leader, and I dont want to do that. I wanted to test HA resilience with simulating probably bad example - deleting pod. EDIT: Or could I be wrong? Since there are PVCs used, new node will know who is the leader and again try to connect to it? If that is true, what happens in case leader gets changed to different node before new node "boots up"?

hixichen commented 4 years ago

Hi @ngarafol,

The following environment variable needs to be added to the Vault StatefulSet for this to work:
- name: VAULT_CLUSTER_ADDR
  value: "https://$(HOSTNAME):8201"
This will change Vault to use dns instead of IP addresses when tracking nodes in the cluster.

Hope that helps!

I have same question with the post.

I do believe that use hostname will help. However, I feel what I really want is to have a way to auto-rejoin when new pod deployed and old pod got deleted.

Right now, the only way is to use rejoin config with fixed entries.

Update at 2022: auto-rejoin can be achived via auto-join with k8s as the provider. refer: https://github.com/hixichen/deploy-open-source-vault-on-gke/blob/main/helm/values-dev.yaml#L116

ngarafol commented 4 years ago

Auto rejoin works for me, as I said. Deleted node (pod) and new node (pod) automatically rejoined since by raft_id its the same node...

gw0 commented 4 years ago

No, it seems that it is not possible to recover a Raft cluster if IP addresses are used and they change.

I have deployed the Helm chart hashicorp/vault-helm in HA mode with Raft and 3 nodes. By default it injects POD_IP addresses everywhere and the Raft setup looks like:

$ vault operator raft list-peers
Node                                    Address               State       Voter
----                                    -------               -----       -----
91ba5725-c624-9915-1fbb-3a8ec171e29f    100.96.12.86:8201     leader      true
d2b72ece-c095-4289-0ee1-a29d60b84324    100.96.14.119:8201    follower    true
f712c3ed-c2a2-9b7d-f83c-effaad8a99af    100.96.8.104:8201     follower    true

If I then take down all the Vault nodes by deleting the Helm chart with $ helm delete --purge vault (leaving PVC and PV intact, this means that the storage is not removed). And I deploy the same Helm chart again and my Kubernetes cluster assigns completely different IP addresses to all Vault nodes. I get the following situation that it is impossible to recover from (almost no command works):

$ vault status
Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    1
Threshold                1
Version                  1.4.2
Cluster Name             vault-cluster-c8fdde71
Cluster ID               8fccaa29-df37-4211-9dfb-17f5d5393a8d
HA Enabled               true
HA Cluster               https://100.96.12.86:8201
HA Mode                  standby
Active Node Address      https://100.96.12.86:8200
Raft Committed Index     2652
Raft Applied Index       2652
$ vault token lookup
Error looking up token: context deadline exceeded
$ vault operator raft list-peers
Error reading the raft cluster configuration: context deadline exceeded
$ vault operator raft join https://vault-api-addr:8200
Error joining the node to the raft cluster: Error making API request.

URL: POST https://127.0.0.1:8200/v1/sys/storage/raft/join
Code: 500. Errors:

* raft storage is already initialized

{"@level":"info","@message":"entering candidate state","@module":"storage.raft","@timestamp":"2020-06-17T13:28:56.379002Z","node":{},"term":544}
{"@level":"debug","@message":"creating rpc dialer","@module":"core.cluster-listener","@timestamp":"2020-06-17T13:28:56.380535Z","alpn":"raft_storage_v1","host":"raft-a906b8db-1279-2d66-4075-be3f5f55b544"}
{"@level":"debug","@message":"votes","@module":"storage.raft","@timestamp":"2020-06-17T13:28:56.382887Z","needed":2}
{"@level":"debug","@message":"vote granted","@module":"storage.raft","@timestamp":"2020-06-17T13:28:56.382932Z","from":"d2b72ece-c095-4289-0ee1-a29d60b84324","tally":1,"term":544}
{"@level":"debug","@message":"creating rpc dialer","@module":"core.cluster-listener","@timestamp":"2020-06-17T13:28:56.382978Z","alpn":"raft_storage_v1","host":"raft-a906b8db-1279-2d66-4075-be3f5f55b544"}
{"@level":"debug","@message":"forwarding: error sending echo request to active node","@module":"core","@timestamp":"2020-06-17T13:28:58.580081Z","error":"rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 100.96.12.86:8201: i/o timeout\""}
{"@level":"error","@message":"failed to make requestVote RPC","@module":"storage.raft","@timestamp":"2020-06-17T13:29:00.270019Z","error":"dial tcp 100.96.12.86:8201: i/o timeout","target":{"Suffrage":0,"ID":"91ba5725-c624-9915-1fbb-3a8ec171e29f","Address":"100.96.12.86:8201"}}
{"@level":"error","@message":"failed to make requestVote RPC","@module":"storage.raft","@timestamp":"2020-06-17T13:29:00.273109Z","error":"dial tcp 100.96.8.104:8201: i/o timeout","target":{"Suffrage":0,"ID":"f712c3ed-c2a2-9b7d-f83c-effaad8a99af","Address":"100.96.8.104:8201"}}
{"@level":"debug","@message":"forwarding: error sending echo request to active node","@module":"core","@timestamp":"2020-06-17T13:29:03.580091Z","error":"rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 100.96.12.86:8201: i/o timeout\""}
{"@level":"warn","@message":"Election timeout reached, restarting election","@module":"storage.raft","@timestamp":"2020-06-17T13:29:03.957558Z"}

As you can see it attempt to connect to the previous active node IP address (100.96.12.86), but there is no Vault node on that IP anymore. And with vault operator raft join it is not possible to join a valid Vault cluster, because it is already initialized. The only solution is to use DNS everywhere as @jasonodonnell suggested or you could risk loosing access to Vault after a disaster.

calvn commented 4 years ago

If the node was removed via remove-peer, you'd have to clear out its raft data first (i.e. the directory specified in the config's storage.path) in order to rejoin it back to the cluster. I'd be good to take a backup of that dir or move it elsewhere before you do so just in case.

fewknow commented 2 years ago

@jasonodonnell cc : @hixichen

I am running integrated storage with RAFT. Version 1.9.3

I have the DNS setup to use headless.

name: VAULT_CLUSTER_ADDR
value: https://$(HOSTNAME).musw2-0-vault-internal:8201

This is my rety_join

  path = "/vault/data"
  retry_join {
      leader_api_addr = "https://musw2-0-vault-0.musw2-0-vault-internal:8201"
      leader_ca_cert_file = "/vault/userconfig/vault-server-tls/vault.ca"
      leader_client_cert_file = "/vault/userconfig/vault-server-tls/vault.crt"
      leader_client_key_file = "/vault/userconfig/vault-server-tls/vault.key"
    }
    retry_join {
      leader_api_addr = "https://musw2-0-vault-1.musw2-0-vault-internal:8201"
      leader_ca_cert_file = "/vault/userconfig/vault-server-tls/vault.ca"
      leader_client_cert_file = "/vault/userconfig/vault-server-tls/vault.crt"
      leader_client_key_file = "/vault/userconfig/vault-server-tls/vault.key"
    }
    retry_join {
      leader_api_addr = "https://musw2-0-vault-2.musw2-0-vault-internal:8201"
      leader_ca_cert_file = "/vault/userconfig/vault-server-tls/vault.ca"
      leader_client_cert_file = "/vault/userconfig/vault-server-tls/vault.crt"
      leader_client_key_file = "/vault/userconfig/vault-server-tls/vault.key"
    }
    retry_join {
      leader_api_addr = "https://musw2-0-vault-3.musw2-0-vault-internal:8201"
      leader_ca_cert_file = "/vault/userconfig/vault-server-tls/vault.ca"
      leader_client_cert_file = "/vault/userconfig/vault-server-tls/vault.crt"
      leader_client_key_file = "/vault/userconfig/vault-server-tls/vault.key"
    }
    retry_join {
      leader_api_addr = "https://musw2-0-vault-4.musw2-0-vault-internal:8201"
      leader_ca_cert_file = "/vault/userconfig/vault-server-tls/vault.ca"
      leader_client_cert_file = "/vault/userconfig/vault-server-tls/vault.crt"
      leader_client_key_file = "/vault/userconfig/vault-server-tls/vault.key"
    }

}

I can confirm the DNS entries are correct

and

The other nodes are getting connection refused and can not join the cluster and the heartbeat is failing.

Logs: │ 2022-06-24T19:21:45.815Z [ERROR] storage.raft: failed to heartbeat to: peer=musw2-0-vault-1.musw2-0-vault-internal:8201 error="dial tcp 172.20.9.24:8201: connect: connection refused" │ │ 2022-06-24T19:21:46.050Z [INFO] http: TLS handshake error from 10.241.247.199:4602: EOF │ │ 2022-06-24T19:21:46.540Z [INFO] http: TLS handshake error from 10.241.247.198:55618: EOF │ │ 2022-06-24T19:21:46.645Z [ERROR] storage.raft: failed to appendEntries to: peer="{Nonvoter f8cf0d96-a735-2172-90da-111e83423303 musw2-0-vault-1.musw2-0-vault-internal:8201}" error="dial tcp 172.20.9.24:8201: co │ │ nnect: connection refused" │ │ 2022-06-24T19:21:47.089Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=["h2", "http/1.1"]

I have 2 separate clusters running and have ran vault operator init on the one above.

The other cluster has similar logs from all nodes and non are unsealed. I am using Azure key/vault for auto unseal.

This is critical for our implementation and we are Enterprise customers and I will be reaching out but wanted to post here as well.

Thanks.

ngarafol commented 2 years ago

@fewknow Thanks for sharing, but I think your issue is more related to cluster being sealed. Original issue I had (OP) was that raft rejoin would not work on already unsealed cluster since IP address was used instead of fqdn.

fewknow commented 2 years ago

@ngarafol - yes, my issue was just ports. 8201 to 8200 solved it. Sorry about the noise.

aphorise commented 2 years ago

I suspect that the issue the related to the setup / configuration (in Azure?).

Hey @ngarafol do you still require further input here or is it okay to close? - sorry I'm late here and trying to understand what's next.

heatherezell commented 7 months ago

Has this issue been reproduced in a current version of Vault? Please let me know if this is still applicable. Thanks!

ngarafol commented 7 months ago

Original issue was due to IP being used instead of fqdn. I believe as long as fqdn is used, this issue is not existing at all. Will close, feel free to reopen.

hashicorp / vault

Raft rejoin issue #8489