hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
31.14k stars 4.21k forks source link

raft storage: nodes join, but cant unseal #25360

Closed szechp closed 8 months ago

szechp commented 8 months ago

Describe the bug The bug is related to the inability to successfully unseal Vault on a High Availability (HA) setup using Raft storage. Despite successfully joining the Raft cluster, the vault remains in a sealed state on one of the nodes (vault-02). Upon retrying the unseal process, an error is encountered, indicating an issue with creating the cipher and an invalid key size.

To Reproduce Steps to reproduce the behavior:

  1. Run vault operator unseal with 3 keys on node-01 which unseals fine.
  2. Run vault operator unseal with 3 keys on node-02 which seems to unseals fine and adds the node:
    
    Node        Address                State       Voter
    ----        -------                -----       -----
    vault-01    node01:8201            leader      true
    vault-02    node02:8201            follower    false
  3. but vault-02 remains sealed.
  4. i re-run vault operator unseal with 3 keys on node-02 which returns this error* failed to create cipher: crypto/aes: invalid key size 0

Expected behavior i want vault-02 to unseal.

Environment:

Vault server configuration file(s):

ui = true

disable_mlock = true

storage "raft" {
retry_join {
  leader_api_addr = "https://node2:8200"
  leader_ca_cert_file = "/opt/vault/tls/vault-ca.pem"
  leader_client_cert_file = "/opt/vault/tls/vault-cert.pem"
  leader_client_key_file = "/opt/vault/tls/vault-key.pem"
}
retry_join {
  leader_api_addr = "https://node1:8200"
  leader_ca_cert_file = "/opt/vault/tls/vault-ca.pem"
  leader_client_cert_file = "/opt/vault/tls/vault-cert.pem"
  leader_client_key_file = "/opt/vault/tls/vault-key.pem"
}
retry_join {
  leader_api_addr = "https://node3:8200"
  leader_ca_cert_file = "/opt/vault/tls/vault-ca.pem"
  leader_client_cert_file = "/opt/vault/tls/vault-cert.pem"
  leader_client_key_file = "/opt/vault/tls/vault-key.pem"
}
path    = "/opt/vault/data"
node_id = "vault-02"
}

listener "tcp" {
  address            = "0.0.0.0:8200"
  tls_cert_file      = "/opt/vault/tls/vault-cert.pem"
  tls_key_file       = "/opt/vault/tls/vault-key.pem"
  tls_client_ca_file = "/opt/vault/tls/vault-ca.pem"
}

cluster_addr  = "https://node2:8201" 
api_addr      = "https://node2:8200"

# (this is the hcl of vault-02)

Additional context I'm using self signed certs i distribute across all nodes.

additional logs:

Feb 12 10:05:07 vault-02 vault[19516]: error=
Feb 12 10:05:07 vault-02 vault[19516]: | error during raft bootstrap init call: Error making API request.
Feb 12 10:05:07 vault-02 vault[19516]: |
Feb 12 10:05:07 vault-02 vault[19516]: | URL: PUT https://node1:8200/v1/sys/storage/raft/bootstrap/challenge
Feb 12 10:05:07 vault-02 vault[19516]: | Code: 503. Errors:
Feb 12 10:05:07 vault-02 vault[19516]: |
Feb 12 10:05:07 vault-02 vault[19516]: | * Vault is sealed
Feb 12 10:05:07 vault-02 vault[19516]:
Feb 12 10:05:07 vault-02 vault[19516]: 2024-02-12T10:05:07.483+0100 [ERROR] core: failed to retry join raft cluster: retry=2s
Feb 12 10:05:07 vault-02 vault[19516]: err=
Feb 12 10:05:07 vault-02 vault[19516]: | failed to send answer to raft leader node: Error making API request.
Feb 12 10:05:07 vault-02 vault[19516]: |
Feb 12 10:05:07 vault-02 vault[19516]: | URL: PUT https://node3:8200/v1/sys/storage/raft/bootstrap/answer
Feb 12 10:05:07 vault-02 vault[19516]: | Code: 500. Errors:
Feb 12 10:05:07 vault-02 vault[19516]: |
Feb 12 10:05:07 vault-02 vault[19516]: | * Preventing server addition that would require removal of too many servers and cause cluster instability
hghaf099 commented 8 months ago

I am curious to learn about the autopilot behaviour. Would you please checkout the tutorial in the link and post here autopilot configuration and state?

szechp commented 8 months ago
vault operator raft autopilot get-config
Key                                   Value
---                                   -----
Cleanup Dead Servers                  false
Last Contact Threshold                10s
Dead Server Last Contact Threshold    24h0m0s
Server Stabilization Time             10s
Min Quorum                            0
Max Trailing Logs                     1000
Disable Upgrade Migration             false
vault operator raft autopilot state
Healthy:                         false
Failure Tolerance:               0
Leader:                          vault-01
Voters:
   vault-01
Servers:
   vault-01
      Name:              vault-01
      Address:           node1:8201
      Status:            leader
      Node Status:       alive
      Healthy:           true
      Last Contact:      0s
      Last Term:         4
      Last Index:        55
      Version:           1.14.9
      Node Type:         voter
   vault-02
      Name:              vault-02
      Address:           node2:8201
      Status:            non-voter
      Node Status:       alive
      Healthy:           false
      Last Contact:      49h4m44.00153693s
      Last Term:         0
      Last Index:        0
      Version:           1.14.9
      Node Type:         voter
   vault-03
      Name:              vault-03
      Address:           node3:8201
      Status:            non-voter
      Node Status:       alive
      Healthy:           false
      Last Contact:      49h4m25.821732653s
      Last Term:         0
      Last Index:        0
      Version:           1.14.9
      Node Type:         voter

the time till last contact exactly lines up with the first unseal of the nodes, which passes with no errors. so there is a unseal, but it somehow reseals and wont let me unseal afterwards.

szechp commented 8 months ago

okay i figured out the problem: i forgot to open up the tcp port 8201 in our firewall.

faiz-credotian commented 5 months ago

Hi,

I am also facing the same issue , my side the ports are opened but still it is showing me same error.

I am using vault version1.16.2.