Open lohmag opened 3 years ago
I'm having the same issue, is there a way to manually promote a node ?
It may be better not have a generic [::]:
type addressing and instead use it's correlating FQDN / resolving address and separately comment any other adaptor bindings like 127.0.0.1
that may for example be needed only on an administrative basis (that can be on demand via config map for example).
In any event - on a cluster of three (3) nodes losing one (1) node and having only two (2) requires both to be present and in-sync to be able to negotiate who then will be the leader - which will fail if both are not in the same responsive or agreed state. This is important cause a loss of quorum will then lead to a need to recover if the (cluster) raft state is no longer present and typical operators like vault operator step-down
that can trigger a re-election can not be issued nor others like vault operator raft remove-peer ...
used for scaling.
Anyway - thinking aloud - if you had five (5) instances and loss one or a couple even then I'd imagine the proper scaling and managing of this would be more easier. Ofc this is all with the assumption that all pods and underlying CNI, Network and all other Infra-level dependencies are always properly in-place for the majority of instances 😄
It may be better not have a generic
[::]:
type addressing and instead use it's correlating FQDN / resolving address and separately comment any other adaptor bindings like127.0.0.1
that may for example be needed only on an administrative basis (that can be on demand via config map for example).In any event - on a cluster of three (3) nodes losing one (1) node and having only two (2) requires both to be present and in-sync to be able to negotiate who then will be the leader - which will fail if both are not in the same responsive or agreed state. This is important cause a loss of quorum will then lead to a need to recover if the (cluster) raft state is no longer present and typical operators like
vault operator step-down
that can trigger a re-election can not be issued nor others likevault operator raft remove-peer ...
used for scaling.Anyway - thinking aloud - if you had five (5) instances and loss one or a couple even then I'd imagine the proper scaling and managing of this would be more easier. Ofc this is all with the assumption that all pods and underlying CNI, Network and all other Infra-level dependencies are always properly in-place for the majority of instances 😄
What address should I use? Is it address = "0.0.0.0:8200"
?
I'm also facing the same issue with Vault 1.5.2 HA vault raft storage.
2021-01-01T18:29:39.806Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter 943ce4ae-6fa0-5257-7a1b-f215388321ee vault-eks-2.vault-eks-internal:8201}" error="read tcp 10.143.137.136:43322->10.143.148.166:8201: i/o timeout
2021-01-01T18:29:40.679Z [WARN] storage.raft: failed to contact: server-id=943ce4ae-6fa0-5257-7a1b-f215388321ee time=2.50381755s
vault server configuration file
disable_mlock = true
ui = true
listener "tcp" {
tls_disable = 1
address = "[::]:8200"
cluster_address = "[::]:8201"
}
storage "raft" {
path = "/vault/data"
}
@lohmag - I was trying to point out that it's better to be explicit with the addressing by way of FQDN or otherwise IP instead of glob level binding which will continue to rebind with all underlying adapter / IP changes even when for example your TLS certificates would not qualify.
One approach can be to have a separate listener stanza on loopback for the purposes of administration and with TLS disabled - then another for the purposes of inter-cluster / node-to-node communication and a third listener stanza for end-users or WAN and each may be with both address
or cluster_address
or only address
if exposing the API only.
I get into this issue when migrate from Consul to Integrated Storage. Configuration migrate.hcl : cluster_addr = http://vault-0.vault-internal:8201 solves the problem.
seems this problem is still not solved. we are facing the same problem
after a crush recovery, the active node address was stucked to the old value thus vault can not be start again
After the nodes' IP addresses changed and all of the servers entered the follower state. https://developer.hashicorp.com/vault/tutorials/raft/raft-lost-quorum Creating a peers.json fixed it for me
I have Vault installed from helm chart running with integrated Raft storage. After a while cluster lost leader and can't reelect one.
vault status
showsActive Node Address
with non existent ip. Looks loke ip has been stuck from old pod and for some reason can't be updated. Cluster was completely unusable I had to restore it from backup.Vault server configuration file(s):