hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
30.04k stars 4.12k forks source link

Unable to restore backup after seal migration #16294

Open mschultz-aofl opened 1 year ago

mschultz-aofl commented 1 year ago

Describe the bug When migrating from awskms to shamir seals, the migration works, however, a backup of the migrated vault cluster is unable to be restored, as it still looks for the KMS key.

To Reproduce

  1. Start with a running vault, using AWSKMS auto-unseal
  2. Follow the migration steps identified here: https://support.hashicorp.com/hc/en-us/articles/360002040848-Seal-Migration
  3. vault status - identify that Vault is unsealed, and it identifies as a shamir seal
  4. Restart vault
  5. vault status - identify that vault is sealed, with a shamir key
  6. Unseal vault using recovery keys - identify that vault is now unsealed, in the same state as 3
  7. Generate backup using: vault operator raft snapshot save /vault/file/shamir.snap
  8. Restore shamir.snap to a new Vault, following the SOP: https://learn.hashicorp.com/tutorials/vault/sop-restore
  9. When attempting to start Vault with the backup restored, notice this error in the logs, and vault immediately stops:

Error initializing core: cannot seal migrate from "awskms" to Shamir, no disabled seal in configuration 2022-07-13T22:13:38.539Z [INFO] proxy environment: http_proxy="" https_proxy="" no_proxy="" 2022-07-13T22:13:38.540Z [INFO] storage.raft.snapshot: reaping snapshot: path=/vault/file/raft/snapshots/3-105277-1657750403964

Expected behavior The shamir.snap should restore as if it was never under AWSKMS

Environment:

Vault server configuration file(s):

        disable_mlock = true
        ui = true
        listener "tcp" {
            tls_disable = 1
            address = "localhost:8200"
            cluster_address = "localhost:8201"
        }

        storage "raft" {
            path = "/vault/file"
        }
        cluster_addr = "http://127.0.0.1:8201"

Additional context Add any other context about the problem here.

mschultz-aofl commented 1 year ago

It's been almost two weeks - has anyone been able to replicate/identify this as a bug or if there just something I'm missing?

stevendpclark commented 1 year ago

Hi @mschultz-aofl,

I've tried to reproduce your issue without success, using the OSS edition of Vault 1.10.0. I'm always able to successfully restore to a new cluster the raft snapshot using just the Shamir keys to unlock it. I have even gone to the extent of disabling the AWS KMS key after the migration without any issues.

With that, I was able to get that the same error message but at an earlier phase. If I stop Vault quickly after unsealing with the recovery keys in migration mode (step 2), remove the awskms seal configuration and start Vault again I do get the error that you mention: "Error initializing core: cannot seal migrate from "awskms" to Shamir, no disabled seal in configuration"

Now that is to be expected as the seal migration has not completed as soon as the recovery keys are provided to Vault with the -migrate flag. Vault is unlocked but the auto-seal migration is only completed once you see the following log messages in the server log:

2022-07-27T13:05:47.553-0400 [INFO]  core: migrating from one auto-unseal to shamir: from=awskms
2022-07-27T13:05:47.985-0400 [INFO]  core: seal migration complete

After those messages are in the logs any raft backup shouldn't contain the auto-seal information anymore.

Could it be that the raft snapshot is taken somehow before the seal migration has completed? I'd be curious to know if you follow all the same steps with the possible tweak of waiting for the above messages to appear in step 2 and removing the seal configuration between steps 3 and 4, if you encounter the same issues?

If you do would it be possible to get more details of the environment you are running the migration within? Say the number of nodes that are part of the raft cluster, which nodes you are performing the commands on, etc.

mschultz-aofl commented 1 year ago

Hi @stevendpclark interesting thought. I had assumed it was an atomic command - that is, when the API returned control, the migration was completed. As this is happening automatically through our backup process, you might be on to something. I do not see the [INFO] core: seal migration complete in our logs, which implies that you're correct. I'll modify my tests to add a delay.

A bit more info - we're looking to automate the verification of our Vaults - we have a few, and if the backups become corrupted due to e.g., node failures/etc, we have a need to know for compliance/operational reasons.

Additionally, we also have a non-compliance but DR scenario need to convert these from KMS to Shamir, in case we lose access/someone deletes the KMS keys. So the general flow is:

  1. Restore backup
  2. Verify backup is <48 hours old
  3. Test integrity with KMS
  4. Convert to Shamir
  5. Take backup of (4)
  6. Restore backup to new server
  7. Test integrity with Shamir

Here is the full gitlab-CI yaml we're using. Note the lack of sleep/delay after unsealing but before taking the backup,

  image: vault:1.10.0
  script:
    - apk add aws-cli jq coreutils
    - export OBJECT="$(aws s3 ls $BUCKET | grep vault_k8s_$VAULT | sort | tail -n 1 | awk '{print $4}')"
    - export DT="$(aws s3 ls $BUCKET | grep vault_k8s_$VAULT | sort | tail -n 1 | awk '{print $1 $2}')" | sed 's/ /T/'
    - echo $DT
    - dtSec=$(date --date "$DT" +'%s')
    - taSec=$(date --date "48 hours ago" +'%s')
    - |
      echo "INFO: dtSec=$dtSec, taSec=$taSec" >&2
    - |
      [ $dtSec -lt $taSec ] && exit 255
    - aws --region us-west-2 s3 cp s3://$BUCKET/$OBJECT ./backup.snap
    - /usr/local/bin/docker-entrypoint.sh vault server -config=/vault/config/local.json > vault_log_0.txt 2>&1 &
    - sleep 3
    - INIT=$(vault operator init -key-shares=1 -key-threshold=1 -format=json)
    - vault operator unseal $(echo $INIT | jq -r .unseal_keys_b64[0])  
    - vault login $(echo $INIT | jq -r .root_token)  
    - vault operator raft snapshot restore -force backup.snap
    - sleep 10
    - killall vault
    - sleep 5
    - echo $VAULT_RECOVERY_CONFIG >> /vault/config/local.json
    - export AWS_ACCESS_KEY_ID=$KMS_AWS_ACCESS_KEY_ID
    - export AWS_SECRET_ACCESS_KEY=$KMS_AWS_SECRET_ACCESS_KEY
    - export VAULT_SEAL_TYPE=awskms
    - vault server -config=/vault/config/local.json > vault_log_1.txt 2>&1 &
    - sleep 3
    - vault login $VAULT_ROOT_TOKEN
    - vault kv list run/  #Verify integrity of restored backup
    - killall vault
    - sleep 5
    - echo seal \"awskms\" \{ >> /vault/config/local.json
    - echo disabled = \"true\" >> /vault/config/local.json
    - echo \} >> /vault/config/local.json
    - vault server -config=/vault/config/local.json  > vault_log_2.txt 2>&1 &
    - sleep 5
    - vault operator unseal -migrate $RECOVERY_KEY_0
    - vault operator unseal -migrate $RECOVERY_KEY_1
    - vault operator unseal -migrate $RECOVERY_KEY_2
    - vault kv list run/
    - vault status
    - killall vault
    - sleep 5
    - vault server -config=/vault/config/local.json  > vault_log_3.txt 2>&1 &
    - sleep 5
    - vault operator unseal $RECOVERY_KEY_0
    - vault operator unseal $RECOVERY_KEY_1
    - vault operator unseal $RECOVERY_KEY_2
    - vault kv list run/
    - vault status
    - vault operator raft snapshot save /vault/file/shamir.snap

To answer your questions, there is one node in the raft cluster, and we're preforming the commands on it directly.

mschultz-aofl commented 1 year ago

@stevendpclark I figured it out. The seal was attempting to migrate AWSKMS to AWSKMS, not to the Shamir seal. If you note in my commands, the export VAULT_SEAL_TYPE=awskms variable was never unset. I determined this with the following logs: 2022-07-28T18:18:09.139Z [WARN] core: entering seal migration mode; Vault will not automatically unseal even if using an autoseal: from_barrier_type=awskms to_barrier_type=awskms

I think I skipped that line previously, due to the confusing output of the vault unseal command. Here is the logs for the migration process:

Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   true
Total Recovery Shares    5
Threshold                3
Unseal Progress          1/3
Unseal Nonce             9042f80d-4c8a-451d-5dce-914ddbed7287
Version                  1.10.5
Storage Type             raft
HA Enabled               true
$ vault operator unseal -migrate $RECOVERY_KEY_1
Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   true
Total Recovery Shares    5
Threshold                3
Unseal Progress          2/3
Unseal Nonce             9042f80d-4c8a-451d-5dce-914ddbed7287
Version                  1.10.5
Storage Type             raft
HA Enabled               true
$ vault operator unseal -migrate $RECOVERY_KEY_2
Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    5
Threshold                3
Version                  1.10.5
Storage Type             raft
Cluster Name             vault-cluster-65ea0ed6
Cluster ID               8725fa79-cfdd-a142-2fbe-eb514a32607c
HA Enabled               true
HA Cluster               https://127.0.0.1:8201/
HA Mode                  active
Active Since             2022-07-28T18:18:40.294711242Z
Raft Committed Index     188116
Raft Applied Index       188116
$ vault kv list run/
Keys
----
backups_placeholder
$ vault status
Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    5
Threshold                3
Version                  1.10.5
Storage Type             raft
Cluster Name             vault-cluster-65ea0ed6
Cluster ID               8725fa79-cfdd-a142-2fbe-eb514a32607c
HA Enabled               true
HA Cluster               https://127.0.0.1:8201/
HA Mode                  active
Active Since             2022-07-28T18:18:40.294711242Z
Raft Committed Index     188116
Raft Applied Index       188116
$ sleep 60
$ killall vault
$ sleep 10
$ vault server -config=/vault/config/local.json  > vault_log_3.txt 2>&1 &
$ sleep 30
$ vault operator unseal $RECOVERY_KEY_0
Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   true
Total Recovery Shares    5
Threshold                3
Unseal Progress          1/3
Unseal Nonce             c064c643-b57d-91c3-c345-198d5811b75a
Version                  1.10.5
Storage Type             raft
HA Enabled               true
$ vault operator unseal $RECOVERY_KEY_1
Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   true
Total Recovery Shares    5
Threshold                3
Unseal Progress          2/3
Unseal Nonce             c064c643-b57d-91c3-c345-198d5811b75a
Version                  1.10.5
Storage Type             raft
HA Enabled               true
$ vault operator unseal $RECOVERY_KEY_2
Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    5
Threshold                3
Version                  1.10.5
Storage Type             raft
Cluster Name             vault-cluster-65ea0ed6
Cluster ID               8725fa79-cfdd-a142-2fbe-eb514a32607c
HA Enabled               true
HA Cluster               https://127.0.0.1:8201/
HA Mode                  active
Active Since             2022-07-28T18:20:22.775415117Z
Raft Committed Index     188126
Raft Applied Index       188126

If you note above, even though it's using the KMS seal, it still unseals as if it's a Shamir seal. This seems like a bug - if it's using AWSKMS for the auto-unseal, a vault operator unseal command should return an error unless it's passed the -migrate flag, not continue the unseal as if it was a different seal type.