hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
30.98k stars 4.19k forks source link

Vault (1.6.2|1.9.1) not showing the correct number of Total Shares after force restore #13416

Open JulianNeuhaus27 opened 2 years ago

JulianNeuhaus27 commented 2 years ago

Describe the bug When initialising an empty Vault cluster with three nodes from scratch with two shared keys and also a key threshold of two the number of shared keys is displayed wrong after a force restore from a snapshot from a Vault with five shared keys. At the leader Vault node the correct number of shared keys is displayed, but the follower will only show, that they are setup with two shared keys. When restarting all Vaults and unsealing them again it is displayed everywhere correctly. I also found out, that when stepping-down the leader the formerly follower, that showed the wrong number of keys is displaying the right number of keys as a leader. Also the former leader will keep displaying the right number of shared keys.

To Reproduce Steps to reproduce the behavior:

  1. Crate empty Vault with raft backend, two keys and also threshold of two keys
  2. Restore snapshot of a Vault, that was initialised with five keys (use -force)
  3. Unseal Vault with the unseal keys from the restored backup
  4. See leader displaying correct number of shared keys, but the two follower will display only two shared keys

Expected behavior The vault status should show the correct number of keys on all nodes.

Environment:

Vault server configuration file(s):

api_addr      = "https://vault-lb-<example.com>"
cluster_addr  = "https://<0.0.0.0>:8201"
disable_mlock = true
pid_file = "/var/run/vault/vault.pid"
ui = true

listener "tcp" {
  address                  = "0.0.0.0:8200"
  cluster_address          = "<0.0.0.0>:8201"
  tls_cert_file            = "/data/etc/vault/ssl.cert"
  tls_key_file             = "/data/etc/vault/ssl.key"
  tls_disable_client_certs = true

  proxy_protocol_behavior = "use_always"
  proxy_protocol_authorized_addrs = "0.0.0.0/0"

  telemetry {
    unauthenticated_metrics_access = true
  }
}

storage "raft" {
  path    = "/data/vault/"
  node_id = "vault-0"

  retry_join {
    leader_api_addr = "https://vault-0-<example.com>:8200"
  }
  retry_join {
    leader_api_addr = "https://vault-1-<example.com>:8200"
  }
  retry_join {
    leader_api_addr = "https://vault-2-<example.com>:8200"
  }
}

telemetry {
  prometheus_retention_time = "30s",
  disable_hostname          = true
}

Additional context All keys are working to unseal after the restore, so my guess is, that this is only a display problem, but nothing deeper in the inner workings of Vault.

mderriey commented 8 months ago

We're experiencing the same issue with Vault 1.15.4.

More info

Since we need an initialised and unsealed node to force-restore a snapshot, we thought we'd make it easier and use a single Shamir key, where the cluster from which the snapshot was taken uses 3.

After force-restoring the snapshot, we do experience a similar issue as the OP, where the seal-status endpoint shows Shamir settings from before the snapshot was restored (a single key with a threshold of 1) where we'd expect it to report the settings from the snapshot (3 keys with a threshold of 2).

Where our experience deviates from the OP is that trying to unseal the node with the keys from the snapshotted cluster fails with the following error:

failed to setup unseal key: crypto/aes: invalid key size 33

We checked and can confirm that the length of Shamir keys does change depending on how many shares are specified during the node initialisation.

This "forces" us to set up the node onto which the snapshot is restored with the same Shamir settings as the cluster from which the snapshot was taken.

It's not the end of the world, but it seems strange.

Is this a bug? Maybe there's an explanation for this behaviour?