hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
31.22k stars 4.22k forks source link

Backup with transit seal method and revoked token silently fails #13130

Open laugmanuel opened 3 years ago

laugmanuel commented 3 years ago

Describe the bug We use Raft as our storage backend. We also do use transit sealing against a secondary Vault instance to provide auto unsealing for our primary Vault installed in Kubernetes. The token we use for that gets created by an init-container and is only valid for a few minutes. Until recently, this setup worked fine for us. The pods got unsealed automatically and the backups were present and valid (could be successfully restored).

Probably due to https://github.com/hashicorp/vault/pull/12388, this behaviour changed! Creating a backup using vault operator raft snapshot save <snapshot file> results in an error regarding the SHA256SUMS.sealed file. Using the API endpoint, we can successfully download the snapshot without any error. In both cases the snapshot file gets created and looks to contain data:

However, the backup can not be restored and Vault complains about Load error in the UI. Restoring using the CLI also fails. If I try to unpack the backup using gzip, I get unexpected end of file -> it looks like the backup file is corrupted.

If I extend the lifetime of the unseal token, the backup gets created and can be restored successfully! There is no word in the docs, that the transit token used in the Vault config/env variables must still be valid for a backup to succeed!

To Reproduce Steps to reproduce the behavior:

  1. Setup a Vault with transit sealing
  2. Issue a new token on the Vault providing the Transit engine
  3. Unseal the new Vault using that token
  4. Wait for the token to be revoked / revoke it manually
  5. Create a backup using API or UI
  6. Try to restore that exact backup

Expected behavior Either a valid backup file (a file that can be extracted using gzip+tar and restored) should be created; even though there is the warning about SHA256SUMS.sealed file. OR the creation of the backup should hard fail without any file being created.

If someone uses the API to create the backup but does not regularly check the restore, there would be no way to see, that the backup file is corrupted.

Also, the docs about raft snapshotting should mention, that the seal-configuration (including the token) must be valid for the backup to fully work.

Environment:

Vault server configuration file(s):

ui = true
disable_mlock = true
log_level = "Info"
log_format = "json"

api_addr = "http://localhost:8200"
cluster_addr = "http://localhost:8201"

listener "tcp" {
  address = "[::]:8200"
  cluster_address = "[::]:8201"

  tls_disable = 1
}

seal "transit" {
  token = "<token>" # this token is the problem
  key_name = "vault-transit"
  mount_path = "transit/"
  address = "https://transit-providing-vault:8200"
}

storage "raft" {
  path = "/data"

  retry_join {
    leader_api_addr = "http://localhost:8200"
  }
}

Additional context There must be a notice in the docs about the token used for transit. The docs and also the howto guides only mention to create a new token and to put it in the config/env variable. This would also break after the default lifetime of 32d:

heatherezell commented 3 years ago

Hi @laugmanuel - were you testing your snapshot restores previously? In #12388, the changes were made to expose broken seals that are resulting in unusable snapshots. Prior to the changes, the snapshot creation would appear to be successful, but the snapshots could not be restored. If you could let us know, I'd appreciate it. :)

laugmanuel commented 3 years ago

Hi @hsimon-hashicorp , yes we did test the restores previously and they were successful. However, I do not remember if this was tested with snapshots created manually/automatism shortly after unsealing the Vault or by the scheduled backup. I can try to reproduce this with a Vault version prior to the mentioned change and report back.

Nevertheless, the other points regarding docs and serving a broken backup through API and UI are still valid 😉

heatherezell commented 3 years ago

When a snapshot is initiated via the API, a success is returned immediately upon the snapshot starting to stream. The snapshot is not buffered on the server, because the size of the snapshot is unknown. So, the snapshot API request returns a "success", starts to stream, and then if at some point the seal isn't available, the snapshot will be broken. This is why testing restores is a critical part of any backup process. Additionally, https://github.com/hashicorp/vault/pull/13078 may help with this, to make detecting seal issues easier and faster. Let me know if this answers your questions about the API. I'll ask @taoism4504 for assistance re: docs.

laugmanuel commented 3 years ago

I've tested with Vault 1.8.5 and Vault 1.7.4 (which does, according to the Changelog, not contain the above fix). In both cases, the snapshot was valid and restorable with a valid token and became broken after the token expired. So I guess, the backups were broken with earlier versions after all.

For us, I fixed it temporarily by issuing a token with a relatively long lifetime (based on an approle which overrides the default ttl of 32d). I will experiment with periodic tokens for transit because the transit seal provider seems to have a refresh feature (disable_renewal = "false") for the token?! https://www.vaultproject.io/docs/configuration/seal/transit#disable_renewal

heatherezell commented 2 years ago

Hi @taoism4504 - we were discussing this today - this might be good to clarify and expand in the snapshot and restore documentation with regards to token longevity and not breaking snapshots. :)

laugmanuel commented 2 years ago

Hi @hsimon-hashicorp , what's the status on this? Using periodic tokens together with disable_renewal = "false" works fine for me; so does using a token with very long TTL. Just wondering if docs will be modified - otherwise we can close this.

bendem commented 1 year ago

We've had this problem happen today, the token in the config for the autounseal had expired. We renewed the token, updated the config, reloaded vault (using kill -HUP), but the snapshot still failed with the same error until we actually restarted all our nodes. If the transit token not reloaded on SIGHUP?

heatherezell commented 7 months ago

Pinging @schavis for docs update. Thanks @laugmanuel!

laugmanuel commented 3 months ago

Pinging @schavis for docs update. Thanks @laugmanuel!

Whats the status here?