hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
30.79k stars 4.17k forks source link

New node cannot join current running cluster (failed to unseal core: error="stored unseal keys are supported, but none were found") #23918

Open trevop opened 10 months ago

trevop commented 10 months ago

Describe the bug Hi. I have a cluster of 3 nodes (initialized, unsealed, cluster is working and has a lot of secrets) with raft storage backend with auto-unseal. Auto-unseal type is GCP KMS.

I am trying to add 2 new nodes to the cluster (one by one). Config is delivered using ansible and is the same across the cluster (except for addr and other node to node params).

After node service is started I receive the following error

2023-10-31T15:31:55.144Z [INFO] core: stored unseal keys supported, attempting fetch 2023-10-31T15:31:55.144Z [WARN] failed to unseal core: error="stored unseal keys are supported, but none were found"

To Reproduce Steps to reproduce the behavior:

  1. Have a working cluster with 3 nodes
  2. Install vault 1.15.1 to the new node
  3. Start service vault OR run vault server -config vault.hcl
  4. See error in logs

Expected behavior A new node successfully joins the cluster and is unsealed with auto-unseal

Environment:

Vault server configuration file(s):

# ---------------------------------------------------------------------------
# config general
# ---------------------------------------------------------------------------

ui                 = true
disable_mlock      = true
plugin_directory   = "/usr/local/lib/vault/plugins"

# ---------------------------------------------------------------------------
# config cluster
# ---------------------------------------------------------------------------

# Configure clustering.

api_addr = "http://hashicorp-vault-02.domain.dev:8200"
# The URL where cluster members can find the leader.
cluster_addr = "http://hashicorp-vault-02.domain.dev:8201"
# ---------------------------------------------------------------------------
# config transit seal 
# ---------------------------------------------------------------------------

seal "gcpckms" {
  credentials = "<path_to_creds.json>"
  project     = "vault-stage"
  region      = "eur6"
  key_ring    = "vault-stage-autounseal"
  crypto_key  = "gcp-auto-unseal-test"
}

# ---------------------------------------------------------------------------
# config listeners
# ---------------------------------------------------------------------------

listener "tcp" {
  address = "0.0.0.0:8200"
    cluster_address = "hashicorp-vault-02.domain.dev:8201"
    max_request_duration = "180s"
    proxy_protocol_behavior = "use_always"
    http_idle_timeout = "10m"
    tls_disable = "true"  
}

# ---------------------------------------------------------------------------
# config storage
# ---------------------------------------------------------------------------

storage "raft" {
  path    = "/opt/vault/data"
  node_id = "hashicorp-vault-02"
  retry_join {
    leader_api_addr = "http://hashicorp-vault-03.domain.dev:8200"
  }
  retry_join {
    leader_api_addr = "http://hashicorp-vault-04.domain.dev:8200"
  }
  retry_join {
    leader_api_addr = "http://hashicorp-vault-02.domain.dev:8200"
  }    
}

# ---------------------------------------------------------------------------
# config Prometheus metrics
# ---------------------------------------------------------------------------

telemetry {
  disable_hostname = true
  prometheus_retention_time = "12h"
}

Vault service systemd file:

[Unit]
Description=HashiCorp Vault
Requires=network-online.target
After=network-online.target

[Service]
ExecStart=/usr/bin/vault server -config "/etc/vault.d/vault.hcl" -log-level=trace
ExecReload=/bin/kill --signal HUP $MAINPID
KillSignal=SIGINT
User=vault
Group=vault
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
trevop commented 10 months ago

Logs from service

==> Vault server configuration:

Administrative Namespace: 
             Api Address: http://hashicorp-vault-05.domain.dev:8200
                     Cgo: disabled
         Cluster Address: https://hashicorp-vault-05.domain.dev:8201
   Environment Variables: BASH_FUNC_which%%, DEBUGINFOD_URLS, GODEBUG, HISTSIZE, HOME, HOSTNAME, LANG, LC_ALL, LC_CTYPE, LESSOPEN, LOGNAME, LS_COLORS, MAIL, OLDPWD, PATH, PWD, SHELL, SHLVL, SUDO_COMMAND, SUDO_GID, SUDO_UID, SUDO_USER, S_COLORS, TERM, USER, VAULT_ADDR, VAULT_CLIENT_TIMEOUT, _, which_declare
              Go Version: go1.21.3
              Listener 1: tcp (addr: "0.0.0.0:8200", cluster address: "hashicorp-vault-05.domain.dev:8201", max_request_duration: "3m0s", max_request_size: "33554432", tls: "disabled")
               Log Level: 
                   Mlock: supported: true, enabled: false
           Recovery Mode: false
                 Storage: raft (HA available)
                 Version: Vault v1.15.1, built 2023-10-20T19:16:11Z
             Version Sha: b94e275f25ccd9011146d14c00ea9e49fd5032dc

==> Vault server started! Log data will stream in below:

2023-10-31T15:24:09.683Z [INFO]  proxy environment: http_proxy="" https_proxy="" no_proxy=""
2023-10-31T15:24:10.073Z [INFO]  incrementing seal generation: generation=1
2023-10-31T15:24:10.074Z [INFO]  core: Initializing version history cache for core
2023-10-31T15:24:10.074Z [INFO]  events: Starting event system
2023-10-31T15:24:10.074Z [INFO]  core: raft retry join initiated
2023-10-31T15:24:10.074Z [INFO]  core: stored unseal keys supported, attempting fetch
2023-10-31T15:24:10.074Z [INFO]  core: security barrier not initialized
2023-10-31T15:24:10.074Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
2023-10-31T15:24:10.075Z [INFO]  core: security barrier not initialized
2023-10-31T15:24:10.075Z [INFO]  core: attempting to join possible raft leader node: leader_addr=http://hashicorp-vault-03.domain.dev:8200
2023-10-31T15:24:10.075Z [INFO]  core: attempting to join possible raft leader node: leader_addr=http://hashicorp-vault-02.domain.dev:8200
2023-10-31T15:24:10.075Z [INFO]  core: attempting to join possible raft leader node: leader_addr=http://hashicorp-vault-04.domain.dev:8200
2023-10-31T15:24:10.146Z [INFO]  core.cluster-listener.tcp: starting listener: listener_address=<node_ip>:8201
2023-10-31T15:24:10.146Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=<node_ip>:8201
2023-10-31T15:24:10.147Z [INFO]  storage.raft: creating Raft: config="&raft.Config{ProtocolVersion:3, HeartbeatTimeout:15000000000, ElectionTimeout:15000000000, CommitTimeout:50000000, MaxAppendEntries:64, BatchApplyCh:true, ShutdownOnRemove:true, TrailingLogs:0x2800, SnapshotInterval:120000000000, SnapshotThreshold:0x2000, LeaderLeaseTimeout:2500000000, LocalID:\"hashicorp-vault-05\", NotifyCh:(chan<- bool)(0xc0034e3ea0), LogOutput:io.Writer(nil), LogLevel:\"DEBUG\", Logger:(*hclog.interceptLogger)(0xc002eba9f0), NoSnapshotRestoreOnStart:true, skipStartup:false}"
2023-10-31T15:24:10.148Z [INFO]  storage.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:hashicorp-vault-03 Address:hashicorp-vault-03.domain.dev:8201} {Suffrage:Voter ID:hashicorp-vault-04 Address:hashicorp-vault-04.domain.dev:8201} {Suffrage:Voter ID:hashicorp-vault-02 Address:hashicorp-vault-02.domain.dev:8201} {Suffrage:Nonvoter ID:hashicorp-vault-05 Address:follower-sec-hashicorp-vault-hotel-05.domain.dev:8201}]"
2023-10-31T15:24:10.148Z [INFO]  core: successfully joined the raft cluster: leader_addr=http://hashicorp-vault-04.domain.dev:8200
2023-10-31T15:24:10.148Z [INFO]  storage.raft: entering follower state: follower="Node at hashicorp-vault-05.domain.dev:8201 [Follower]" leader-address= leader-id=
2023-10-31T15:24:15.075Z [INFO]  core: stored unseal keys supported, attempting fetch
2023-10-31T15:24:15.075Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
2023-10-31T15:24:20.075Z [INFO]  core: stored unseal keys supported, attempting fetch
2023-10-31T15:24:20.075Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
2023-10-31T15:24:25.076Z [INFO]  core: stored unseal keys supported, attempting fetch
2023-10-31T15:24:25.076Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
2023-10-31T15:24:26.628Z [WARN]  storage.raft: heartbeat timeout reached, not part of a stable configuration or a non-voter, not triggering a leader election
2023-10-31T15:24:30.077Z [INFO]  core: stored unseal keys supported, attempting fetch
2023-10-31T15:24:30.077Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"

and so on repeating failed to unseal core: error="stored unseal keys are supported, but none were found error

Fore-4 commented 9 months ago

Are there something news how to solve this?

trevop commented 9 months ago

@ccapurso Hi! Is there any news on this bug? We are still unable to add nodes to the cluster

thnee commented 8 months ago

If I understand correctly, failed to unseal core: error="stored unseal keys are supported, but none were found" means that it has joined the cluster successfully, but no data has been replicated from any active cluster node to the newly joined node, so it doesn't have what it needs to unseal itself, which leaves the node in a kind of inconsistent state.

I had this error, and it was simply because port 8201 was not opened in the security group. This is perhaps not the same root cause as OP, but I felt it does at least warrant a comment here.

Was scratching my head over this for a bit, before I understood what the problem was. Because the log messages are kind of confusing in this case, especially on the joiner node.

In this setup I have three server nodes which are in a cluster, and one master node which is not in a cluster. The master node has transit engine enabled and is configured for auto unseal. The server nodes are configured to auto unseal using the transit engine that is setup on the master node.

vault-server-3 claims to have successfully joined the cluster. It shows no error at all. It would be a lot better if it would produce some kind of error during the cluster join process.

Dec 15 13:20:33 vault-server-3 vault[1441]: 2023-12-15T13:20:33.839Z [INFO]  core: attempting to join possible raft leader node: leader_addr=http://10.60.2.1:8200
Dec 15 13:20:33 vault-server-3 vault[1441]: 2023-12-15T13:20:33.856Z [INFO]  core.cluster-listener.tcp: starting listener: listener_address=127.0.0.1:8201
Dec 15 13:20:33 vault-server-3 vault[1441]: 2023-12-15T13:20:33.856Z [INFO]  core.cluster-listener.tcp: starting listener: listener_address=10.60.2.3:8201
Dec 15 13:20:33 vault-server-3 vault[1441]: 2023-12-15T13:20:33.856Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=127.0.0.1:8201
Dec 15 13:20:33 vault-server-3 vault[1441]: 2023-12-15T13:20:33.856Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=10.60.2.3:8201
Dec 15 13:20:33 vault-server-3 vault[1441]: 2023-12-15T13:20:33.859Z [INFO]  storage.raft: creating Raft: config="&raft.Config{ProtocolVersion:3, HeartbeatTimeout:15000000000, ElectionTimeout:15000000000, CommitTimeout:50000000, MaxAppendEntries:64, BatchApplyCh:true, ShutdownOnRemov>Dec 15 13:20:33 vault-server-3 vault[1441]: 2023-12-15T13:20:33.860Z [INFO]  storage.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:vault-server-1 Address:10.60.2.1:8201} {Suffrage:Nonvoter ID:vault-server-3 Address:10.60.2.3:8201}]"
Dec 15 13:20:33 vault-server-3 vault[1441]: 2023-12-15T13:20:33.860Z [INFO]  core: successfully joined the raft cluster: leader_addr=http://10.60.2.1:8200
Dec 15 13:20:33 vault-server-3 vault[1441]: 2023-12-15T13:20:33.864Z [INFO]  storage.raft: entering follower state: follower="Node at 10.60.2.3:8201 [Follower]" leader-address= leader-id=
Dec 15 13:20:33 vault-server-3 vault[1441]: 2023-12-15T13:20:33.925Z [INFO]  core: stored unseal keys supported, attempting fetch
Dec 15 13:20:33 vault-server-3 vault[1441]: 2023-12-15T13:20:33.925Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
Dec 15 13:20:38 vault-server-3 vault[1441]: 2023-12-15T13:20:38.925Z [INFO]  core: stored unseal keys supported, attempting fetch
Dec 15 13:20:38 vault-server-3 vault[1441]: 2023-12-15T13:20:38.927Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"

vault-server-1 also seems to be happy with the joining. But at least it says i/o timeout when trying to replicate data, which is what helped me understand the problem.

Dec 15 13:20:33 vault-server-1 vault[843]: 2023-12-15T13:20:33.850Z [INFO]  storage.raft: updating configuration: command=AddNonvoter server-id=vault-server-3 server-addr=10.60.2.3:8201 servers="[{Suffrage:Voter ID:vault-server-1 Address:10.60.2.1:8201} {Suffrage:Nonvoter ID:vault-se>Dec 15 13:20:33 vault-server-1 vault[843]: 2023-12-15T13:20:33.853Z [INFO]  storage.raft: added peer, starting replication: peer=vault-server-3
Dec 15 13:20:33 vault-server-1 vault[843]: 2023-12-15T13:20:33.853Z [INFO]  system: follower node answered the raft bootstrap challenge: follower_server_id=vault-server-3
Dec 15 13:20:43 vault-server-1 vault[843]: 2023-12-15T13:20:43.854Z [ERROR] storage.raft: failed to appendEntries to: peer="{Nonvoter vault-server-3 10.60.2.3:8201}" error="dial tcp 10.60.2.3:8201: i/o timeout"
Dec 15 13:20:44 vault-server-1 vault[843]: 2023-12-15T13:20:44.572Z [ERROR] storage.raft: failed to heartbeat to: peer=10.60.2.3:8201 backoff time=10ms error="dial tcp 10.60.2.3:8201: i/o timeout"
Dec 15 13:20:53 vault-server-1 vault[843]: 2023-12-15T13:20:53.866Z [ERROR] storage.raft: failed to appendEntries to: peer="{Nonvoter vault-server-3 10.60.2.3:8201}" error="dial tcp 10.60.2.3:8201: i/o timeout"
Dec 15 13:20:55 vault-server-1 vault[843]: 2023-12-15T13:20:55.380Z [ERROR] storage.raft: failed to heartbeat to: peer=10.60.2.3:8201 backoff time=10ms error="dial tcp 10.60.2.3:8201: i/o timeout"

Using Vault v1.15.4.

trevop commented 8 months ago

Hi.

@thnee Thank you for your comment. However it is not my case. I have tested it with any-any security group rule in all of the nodes and still getting the same error :(

trevop commented 5 months ago

@ccapurso Hi! Is there any news?

upenderadepu-moe commented 3 months ago

Hi @ccapurso, Is there any update on this ?

SanduDS commented 1 month ago

Hi @ccapurso, Any update? :( Still getting the same error with auto-unseal with Azure KMS.

e100 commented 2 weeks ago

I run vault in k8s (v1.28.9) internal raft storage and use transit unseal. After having some major network disruptions multiple vault clusters only had 3 of 5 or 4 of 5 pods in ready status.

Had same error:

failed to unseal core: error="stored unseal keys are supported, but none were found"

On the leader I noticed it was logging "connection refused" when trying to connect to port 8201 on the node having the above noted error, or it was reporting a DNS lookup failure. Sorry I did not document the exact error.

I exec into that node and checked with netstat, it was not listening on port 8201.

At the time I was running version 1.17.2 I ran across #24604 so I tried adding publishNotReadyAddresses: true to the vault service. That resolved the DNS lookup failures, but still did not result in the broken nodes properly joining and unsealing.

The k8s documentation does have a note related to DNS that states:

Also, the Pod needs to be ready in order to have a record unless publishNotReadyAddresses=True is set on the Service.

Next I added an init pod with command set to sleep 30 I then deleted the bad pod, exec into the init, deleted all the data, vault.db and raft folder. At the same time I was logged into the vault leader and issued vault operator raft remove-peer node-name-here before the 30 seconds was up.

This did not change anything, still seeing the same unseal keys error.

Next I deployed version 1.17.3, but not all of the nodes had restarted, the leader being one of the ones still running 1.17.2, the rollout stopped because the broken pod does not become ready. But the broken pod was now running 1.17.3. I repeated the same delete pod, delete data and remove-peer during init. Still no change.

I manually deleted all of the 1.17.2 pods, in order so they would restart as 1.17.3. It was vault-2 that was the bad pod and where the rollout stopped, so I deleted vault-1 then vault-0, if you tried deleting vault-0 first it would restart as 1.17.2 not 1.17.3.

Now that the leader and all the other pods are running 1.17.3, I repeated the same delete pod, delete data and remove-peer during the init. Now the pod rejoined correctly.

After fixing this first cluster I did the same on the remaining three clusters: adding publishNotReadyAddresses: true to the vault service. Updated to 1.17.3, deleted pods in order when needed to ensure all were running 1.17.3 Did the deleted pod then delete data and remove-peer during the init.

I think #27344 and #18004 may be related to this problem

Hope this helps someone, delete data at your own risk. If I did not have quorum on the other nodes I would not have deleted data on the bad one.