Vault `Active Node Address` stuck

lohmag commented 3 years ago

I have Vault installed from helm chart running with integrated Raft storage. After a while cluster lost leader and can't reelect one. vault status shows Active Node Address with non existent ip. Looks loke ip has been stuck from old pod and for some reason can't be updated. Cluster was completely unusable I had to restore it from backup.

kubectl exec -it vault-0 -- sh
/ $ vault status
Key                     Value
---                     -----
Seal Type               shamir
Initialized             true
Sealed                  false
Total Shares            1
Threshold               1
Version                 1.5.2
Cluster Name            hcloud
Cluster ID              0b87011b-7f4b-7261-a4da-f8cc9a6f43ef
HA Enabled              true
HA Cluster              https://vault-0.vault-internal:8201
HA Mode                 standby
Active Node Address     http://10.222.209.14:8200
Raft Committed Index    55129
Raft Applied Index      55129

kubectl exec -it vault-1 -- sh
/ $ vault status
Key                     Value
---                     -----
Seal Type               shamir
Initialized             true
Sealed                  false
Total Shares            1
Threshold               1
Version                 1.5.2
Cluster Name            hcloud
Cluster ID              0b87011b-7f4b-7261-a4da-f8cc9a6f43ef
HA Enabled              true
HA Cluster              n/a
HA Mode                 standby
Active Node Address     <none>
Raft Committed Index    72740
Raft Applied Index      72349

kubectl exec -it vault-2 -- sh
/ $ vault status
Key                     Value
---                     -----
Seal Type               shamir
Initialized             true
Sealed                  false
Total Shares            1
Threshold               1
Version                 1.5.2
Cluster Name            hcloud
Cluster ID              0b87011b-7f4b-7261-a4da-f8cc9a6f43ef
HA Enabled              true
HA Cluster              n/a
HA Mode                 standby
Active Node Address     <none>
Raft Committed Index    72829
Raft Applied Index      72812

kubectl logs   vault-2
==> Vault server configuration:

             Api Address: http://10.223.199.71:8200
                     Cgo: disabled
         Cluster Address: https://vault-2.vault-internal:8201
              Go Version: go1.14.7
              Listener 1: tcp (addr: "[::]:8200", cluster address: "[::]:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled")
               Log Level: info
                   Mlock: supported: true, enabled: false
           Recovery Mode: false
                 Storage: raft (HA available)
                 Version: Vault v1.5.2
             Version Sha: 685fdfa60d607bca069c09d2d52b6958a7a2febd

==> Vault server started! Log data will stream in below:

2020-11-09T12:08:13.669Z [INFO]  proxy environment: http_proxy= https_proxy= no_proxy=
2020-11-09T12:08:13.735Z [INFO]  core: raft retry join initiated
2020-11-09T12:08:55.316Z [INFO]  core.cluster-listener.tcp: starting listener: listener_address=[::]:8201
2020-11-09T12:08:55.316Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
2020-11-09T12:08:55.435Z [INFO]  storage.raft: initial configuration: index=5309 servers="[{Suffrage:Voter ID:47b5191a-07ba-af7b-3187-62a44a38c423 Address:127.0.0.1:8201} {Suffrage:Voter ID:d45ece2f-c649-207b-23c5-485c987074f8 Address:vault-1.vault-internal:8201} {Suffrage:Voter ID:51f3c748-a046-2537-5a43-5af2d01ab96a Address:vault-2.vault-internal:8201}]"
2020-11-09T12:08:55.436Z [INFO]  core: vault is unsealed
2020-11-09T12:08:55.436Z [INFO]  storage.raft: entering follower state: follower="Node at vault-2.vault-internal:8201 [Follower]" leader=
2020-11-09T12:08:55.436Z [INFO]  core: entering standby mode
2020-11-09T12:08:55.634Z [WARN]  storage.raft: rejecting vote request since our last term is greater: candidate=vault-0.vault-internal:8201 last-term=31149 last-candidate-term=3
2020-11-09T12:09:01.892Z [WARN]  storage.raft: previous log term mis-match: ours=28555 remote=46879
2020-11-09T12:09:04.590Z [WARN]  storage.raft: rejecting vote request since we have a leader: from=vault-0.vault-internal:8201 leader=vault-1.vault-internal:8201
2020-11-09T12:09:08.317Z [WARN]  storage.raft: heartbeat timeout reached, starting election: last-leader=vault-1.vault-internal:8201
2020-11-09T12:09:08.317Z [INFO]  storage.raft: entering candidate state: node="Node at vault-2.vault-internal:8201 [Candidate]" term=46881
2020-11-09T12:09:08.467Z [INFO]  storage.raft: duplicate requestVote for same term: term=46881
2020-11-09T12:09:08.467Z [WARN]  storage.raft: duplicate requestVote from: candidate=vault-2.vault-internal:8201
2020-11-09T12:09:08.467Z [INFO]  storage.raft: election won: tally=2
2020-11-09T12:09:08.467Z [INFO]  storage.raft: entering leader state: leader="Node at vault-2.vault-internal:8201 [Leader]"
2020-11-09T12:09:08.467Z [INFO]  storage.raft: added peer, starting replication: peer=47b5191a-07ba-af7b-3187-62a44a38c423
2020-11-09T12:09:08.467Z [INFO]  storage.raft: added peer, starting replication: peer=d45ece2f-c649-207b-23c5-485c987074f8
2020-11-09T12:09:08.508Z [INFO]  storage.raft: entering follower state: follower="Node at vault-2.vault-internal:8201 [Follower]" leader=
2020-11-09T12:09:08.508Z [ERROR] core: failed to acquire lock: error="leadership lost while committing log"
2020-11-09T12:09:08.596Z [WARN]  storage.raft: appendEntries rejected, sending older logs: peer="{Voter d45ece2f-c649-207b-23c5-485c987074f8 vault-1.vault-internal:8201}" next=72771
2020-11-09T12:09:10.483Z [WARN]  storage.raft: rejecting vote request since our last term is greater: candidate=vault-1.vault-internal:8201 last-term=46881 last-candidate-term=46880
2020-11-09T12:09:10.483Z [WARN]  storage.raft: previous log term mis-match: ours=28556 remote=46880
2020-11-09T12:09:10.484Z [WARN]  storage.raft: previous log term mis-match: ours=28555 remote=46879
2020-11-09T12:09:11.730Z [WARN]  storage.raft: rejecting vote request since we have a leader: from=vault-0.vault-internal:8201 leader=vault-1.vault-internal:8201
2020-11-09T12:09:14.499Z [WARN]  storage.raft: heartbeat timeout reached, starting election: last-leader=vault-1.vault-internal:8201
2020-11-09T12:09:14.499Z [INFO]  storage.raft: entering candidate state: node="Node at vault-2.vault-internal:8201 [Candidate]" term=46883
2020-11-09T12:09:14.536Z [INFO]  storage.raft: duplicate requestVote for same term: term=46883
2020-11-09T12:09:14.536Z [WARN]  storage.raft: duplicate requestVote from: candidate=vault-2.vault-internal:8201
2020-11-09T12:09:14.537Z [INFO]  storage.raft: election won: tally=2
2020-11-09T12:09:14.537Z [INFO]  storage.raft: entering leader state: leader="Node at vault-2.vault-internal:8201 [Leader]"
2020-11-09T12:09:14.537Z [INFO]  storage.raft: added peer, starting replication: peer=47b5191a-07ba-af7b-3187-62a44a38c423
2020-11-09T12:09:14.537Z [INFO]  storage.raft: added peer, starting replication: peer=d45ece2f-c649-207b-23c5-485c987074f8
2020-11-09T12:09:14.559Z [INFO]  storage.raft: entering follower state: follower="Node at vault-2.vault-internal:8201 [Follower]" leader=
2020-11-09T12:09:15.270Z [WARN]  storage.raft: appendEntries rejected, sending older logs: peer="{Voter d45ece2f-c649-207b-23c5-485c987074f8 vault-1.vault-internal:8201}" next=72772
2020-11-09T12:09:18.508Z [ERROR] core: failed to acquire lock: error="node is not the leader"
2020-11-09T12:09:18.771Z [INFO]  storage.raft: duplicate requestVote for same term: term=46883
2020-11-09T12:09:19.592Z [WARN]  storage.raft: heartbeat timeout reached, starting election: last-leader=
2020-11-09T12:09:19.592Z [INFO]  storage.raft: entering candidate state: node="Node at vault-2.vault-internal:8201 [Candidate]" term=46884
2020-11-09T12:09:19.623Z [INFO]  storage.raft: duplicate requestVote for same term: term=46884
2020-11-09T12:09:19.623Z [WARN]  storage.raft: duplicate requestVote from: candidate=vault-2.vault-internal:8201
2020-11-09T12:09:19.624Z [INFO]  storage.raft: election won: tally=2
2020-11-09T12:09:19.624Z [INFO]  storage.raft: entering leader state: leader="Node at vault-2.vault-internal:8201 [Leader]"
2020-11-09T12:09:19.624Z [INFO]  storage.raft: added peer, starting replication: peer=47b5191a-07ba-af7b-3187-62a44a38c423
2020-11-09T12:09:19.624Z [INFO]  storage.raft: added peer, starting replication: peer=d45ece2f-c649-207b-23c5-485c987074f8
2020-11-09T12:09:19.636Z [WARN]  storage.raft: appendEntries rejected, sending older logs: peer="{Voter d45ece2f-c649-207b-23c5-485c987074f8 vault-1.vault-internal:8201}" next=72772
2020-11-09T12:09:19.638Z [WARN]  storage.raft: appendEntries rejected, sending older logs: peer="{Voter d45ece2f-c649-207b-23c5-485c987074f8 vault-1.vault-internal:8201}" next=72771
2020-11-09T12:09:19.640Z [WARN]  storage.raft: appendEntries rejected, sending older logs: peer="{Voter d45ece2f-c649-207b-23c5-485c987074f8 vault-1.vault-internal:8201}" next=72770
2020-11-09T12:09:19.641Z [WARN]  storage.raft: appendEntries rejected, sending older logs: peer="{Voter d45ece2f-c649-207b-23c5-485c987074f8 vault-1.vault-internal:8201}" next=72769
2020-11-09T12:09:19.643Z [WARN]  storage.raft: appendEntries rejected, sending older logs: peer="{Voter d45ece2f-c649-207b-23c5-485c987074f8 vault-1.vault-internal:8201}" next=72768
2020-11-09T12:09:19.644Z [WARN]  storage.raft: appendEntries rejected, sending older logs: peer="{Voter d45ece2f-c649-207b-23c5-485c987074f8 vault-1.vault-internal:8201}" next=72767
2020-11-09T12:09:19.645Z [WARN]  storage.raft: appendEntries rejected, sending older logs: peer="{Voter d45ece2f-c649-207b-23c5-485c987074f8 vault-1.vault-internal:8201}" next=72766
2020-11-09T12:09:19.646Z [WARN]  storage.raft: appendEntries rejected, sending older logs: peer="{Voter d45ece2f-c649-207b-23c5-485c987074f8 vault-1.vault-internal:8201}" next=72765
2020-11-09T12:09:19.647Z [WARN]  storage.raft: appendEntries rejected, sending older logs: peer="{Voter d45ece2f-c649-207b-23c5-485c987074f8 vault-1.vault-internal:8201}" next=72764
2020-11-09T12:09:19.651Z [INFO]  storage.raft: entering follower state: follower="Node at vault-2.vault-internal:8201 [Follower]" leader=
2020-11-09T12:09:19.652Z [WARN]  storage.raft: appendEntries rejected, sending older logs: peer="{Voter d45ece2f-c649-207b-23c5-485c987074f8 vault-1.vault-internal:8201}" next=72763

kubectl logs vault-0
==> Vault server configuration:

             Api Address: http://10.222.208.6:8200
                     Cgo: disabled
         Cluster Address: https://vault-0.vault-internal:8201
              Go Version: go1.14.7
              Listener 1: tcp (addr: "[::]:8200", cluster address: "[::]:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled")
               Log Level: info
                   Mlock: supported: true, enabled: false
           Recovery Mode: false
                 Storage: raft (HA available)
                 Version: Vault v1.5.2
             Version Sha: 685fdfa60d607bca069c09d2d52b6958a7a2febd

==> Vault server started! Log data will stream in below:

2020-11-09T12:17:29.471Z [INFO]  proxy environment: http_proxy= https_proxy= no_proxy=
2020-11-09T12:17:29.865Z [INFO]  core: raft retry join initiated
2020-11-09T12:18:00.539Z [INFO]  core.cluster-listener.tcp: starting listener: listener_address=[::]:8201
2020-11-09T12:18:00.539Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
2020-11-09T12:18:00.553Z [INFO]  storage.raft: initial configuration: index=5309 servers="[{Suffrage:Voter ID:47b5191a-07ba-af7b-3187-62a44a38c423 Address:127.0.0.1:8201} {Suffrage:Voter ID:d45ece2f-c649-207b-23c5-485c987074f8 Address:vault-1.vault-internal:8201} {Suffrage:Voter ID:51f3c748-a046-2537-5a43-5af2d01ab96a Address:vault-2.vault-internal:8201}]"
2020-11-09T12:18:00.553Z [INFO]  storage.raft: entering follower state: follower="Node at vault-0.vault-internal:8201 [Follower]" leader=
2020-11-09T12:18:00.553Z [INFO]  core: vault is unsealed
2020-11-09T12:18:00.554Z [INFO]  core: entering standby mode
2020-11-09T12:18:00.640Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-11-09T12:18:01.645Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-11-09T12:18:03.291Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-11-09T12:18:06.196Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-11-09T12:18:06.549Z [WARN]  storage.raft: heartbeat timeout reached, starting election: last-leader=
2020-11-09T12:18:06.549Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=46971
2020-11-09T12:18:06.700Z [INFO]  storage.raft: entering follower state: follower="Node at vault-0.vault-internal:8201 [Follower]" leader=
2020-11-09T12:18:09.922Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-11-09T12:18:13.494Z [WARN]  storage.raft: heartbeat timeout reached, starting election: last-leader=
2020-11-09T12:18:13.494Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=46979
2020-11-09T12:18:17.575Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-11-09T12:18:23.396Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:18:23.396Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=46980
2020-11-09T12:18:26.771Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-11-09T12:18:28.666Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:18:28.666Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=46981
2020-11-09T12:18:38.063Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:18:38.063Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=46982
2020-11-09T12:18:44.022Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-11-09T12:18:47.573Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:18:47.573Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=46983
2020-11-09T12:18:54.343Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:18:54.343Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=46984
2020-11-09T12:19:02.716Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:19:02.717Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=46985
2020-11-09T12:19:08.119Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:19:08.119Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=46986
2020-11-09T12:19:09.205Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-11-09T12:19:14.357Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:19:14.357Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=46987
2020-11-09T12:19:24.188Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:19:24.188Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=46988
2020-11-09T12:19:24.236Z [INFO]  storage.raft: entering follower state: follower="Node at vault-0.vault-internal:8201 [Follower]" leader=
2020-11-09T12:19:33.333Z [WARN]  storage.raft: heartbeat timeout reached, starting election: last-leader=
2020-11-09T12:19:33.333Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=46990
2020-11-09T12:19:35.868Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing remote error: tls: internal error""
2020-11-09T12:19:35.869Z [ERROR] core: forward request error: error="error during forwarding RPC request"
2020-11-09T12:19:41.031Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:19:41.031Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=46991
2020-11-09T12:19:50.961Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:19:50.961Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=46992
2020-11-09T12:19:50.990Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-11-09T12:19:59.811Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:19:59.811Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=46993
2020-11-09T12:19:59.860Z [INFO]  storage.raft: entering follower state: follower="Node at vault-0.vault-internal:8201 [Follower]" leader=
2020-11-09T12:20:06.604Z [WARN]  storage.raft: heartbeat timeout reached, starting election: last-leader=
2020-11-09T12:20:06.604Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=46995
2020-11-09T12:20:15.501Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:20:15.501Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=46996
2020-11-09T12:20:21.443Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:20:21.444Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=46997
2020-11-09T12:20:29.106Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:20:29.106Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=46998
2020-11-09T12:20:37.547Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:20:37.547Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=46999
2020-11-09T12:20:45.157Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:20:45.157Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=47000
2020-11-09T12:20:53.489Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:20:53.489Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=47001
2020-11-09T12:21:00.251Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:21:00.251Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=47002
2020-11-09T12:21:05.444Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:21:05.444Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=47003
2020-11-09T12:21:10.209Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-11-09T12:21:11.439Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:21:11.440Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=47004
2020-11-09T12:21:19.181Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:21:19.181Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=47005
2020-11-09T12:21:25.797Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:21:25.797Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=47006
2020-11-09T12:21:31.475Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:21:31.476Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=47007
2020-11-09T12:21:39.743Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:21:39.743Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=47008
2020-11-09T12:21:39.786Z [INFO]  storage.raft: entering follower state: follower="Node at vault-0.vault-internal:8201 [Follower]" leader=
2020-11-09T12:21:47.481Z [WARN]  storage.raft: heartbeat timeout reached, starting election: last-leader=
2020-11-09T12:21:47.481Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=47010
2020-11-09T12:21:55.706Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:21:55.706Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=47011
2020-11-09T12:22:02.749Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:22:02.749Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=47012
2020-11-09T12:22:11.140Z [WARN]  storage.raft: Election timeout reached, restarting election
2020-11-09T12:22:11.140Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=47013

kubectl exec -it vault-1 -- sh
/ $ vault operator raft list-peers
Error reading the raft cluster configuration: Error making API request.

URL: GET http://127.0.0.1:8200/v1/sys/storage/raft/configuration
Code: 500. Errors:

* local node not active but active cluster node not found

Vault server configuration file(s):

disable_mlock = true
ui = true
listener "tcp" {
  tls_disable = 1
  address = "[::]:8200"
  cluster_address = "[::]:8201"
}
storage "raft" {
  path = "/vault/data"
  retry_join {
    leader_api_addr         = "http://vault-0.vault-internal:8200"
  }
  retry_join {
    leader_api_addr         = "http://vault-1.vault-internal:8200"
  }
  retry_join {
    leader_api_addr         = "http://vault-2.vault-internal:8200"
  }
}

dcshiman commented 3 years ago

I'm having the same issue, is there a way to manually promote a node ?

aphorise commented 3 years ago

It may be better not have a generic [::]: type addressing and instead use it's correlating FQDN / resolving address and separately comment any other adaptor bindings like 127.0.0.1 that may for example be needed only on an administrative basis (that can be on demand via config map for example).

In any event - on a cluster of three (3) nodes losing one (1) node and having only two (2) requires both to be present and in-sync to be able to negotiate who then will be the leader - which will fail if both are not in the same responsive or agreed state. This is important cause a loss of quorum will then lead to a need to recover if the (cluster) raft state is no longer present and typical operators like vault operator step-down that can trigger a re-election can not be issued nor others like vault operator raft remove-peer ... used for scaling.

Anyway - thinking aloud - if you had five (5) instances and loss one or a couple even then I'd imagine the proper scaling and managing of this would be more easier. Ofc this is all with the assumption that all pods and underlying CNI, Network and all other Infra-level dependencies are always properly in-place for the majority of instances 😄

lohmag commented 3 years ago

It may be better not have a generic [::]: type addressing and instead use it's correlating FQDN / resolving address and separately comment any other adaptor bindings like 127.0.0.1 that may for example be needed only on an administrative basis (that can be on demand via config map for example).

In any event - on a cluster of three (3) nodes losing one (1) node and having only two (2) requires both to be present and in-sync to be able to negotiate who then will be the leader - which will fail if both are not in the same responsive or agreed state. This is important cause a loss of quorum will then lead to a need to recover if the (cluster) raft state is no longer present and typical operators like vault operator step-down that can trigger a re-election can not be issued nor others like vault operator raft remove-peer ... used for scaling.

Anyway - thinking aloud - if you had five (5) instances and loss one or a couple even then I'd imagine the proper scaling and managing of this would be more easier. Ofc this is all with the assumption that all pods and underlying CNI, Network and all other Infra-level dependencies are always properly in-place for the majority of instances 😄

What address should I use? Is it address = "0.0.0.0:8200"?

infa-vsamidur commented 3 years ago

I'm also facing the same issue with Vault 1.5.2 HA vault raft storage.

2021-01-01T18:29:39.806Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter 943ce4ae-6fa0-5257-7a1b-f215388321ee vault-eks-2.vault-eks-internal:8201}" error="read tcp 10.143.137.136:43322->10.143.148.166:8201: i/o timeout

2021-01-01T18:29:40.679Z [WARN]  storage.raft: failed to contact: server-id=943ce4ae-6fa0-5257-7a1b-f215388321ee time=2.50381755s

vault server configuration file

disable_mlock = true
ui = true
listener "tcp" {
  tls_disable = 1
  address = "[::]:8200"
  cluster_address = "[::]:8201"
}

storage "raft" {
  path = "/vault/data"
  }

aphorise commented 3 years ago

@lohmag - I was trying to point out that it's better to be explicit with the addressing by way of FQDN or otherwise IP instead of glob level binding which will continue to rebind with all underlying adapter / IP changes even when for example your TLS certificates would not qualify.

One approach can be to have a separate listener stanza on loopback for the purposes of administration and with TLS disabled - then another for the purposes of inter-cluster / node-to-node communication and a third listener stanza for end-users or WAN and each may be with both address or cluster_address or only address if exposing the API only.

mnit016 commented 2 years ago

I get into this issue when migrate from Consul to Integrated Storage. Configuration migrate.hcl : cluster_addr = http://vault-0.vault-internal:8201 solves the problem.

CNLHC commented 3 weeks ago

seems this problem is still not solved. we are facing the same problem

after a crush recovery, the active node address was stucked to the old value thus vault can not be start again

cjtim commented 1 week ago

After the nodes' IP addresses changed and all of the servers entered the follower state. https://developer.hashicorp.com/vault/tutorials/raft/raft-lost-quorum Creating a peers.json fixed it for me

hashicorp / vault

Vault `Active Node Address` stuck #10361