hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
31.22k stars 4.22k forks source link

Error when unsealing restored raft snapshot: `failed to create cipher: crypto/aes: invalid key size 0` #28880

Open cottand opened 4 days ago

cottand commented 4 days ago

Describe the bug

We have a development cluster of Vault that is a few years old, with a 3-out-of-7 Shamir seal structure. After having migrated it to raft, I took a raft snapshot via the API, then proceeded to restore it in another 5-node cluster (also 3-out-of-7 Shamir seal structure). The API call to restore does succeed.

When trying to unseal the newly restored cluster, I get the following error after the last unseal key.

Error unsealing: Error making API request.

URL: PUT http://server1:8200/v1/sys/unseal
Code: 500. Errors:

* failed to create cipher: crypto/aes: invalid key size 0

I have managed to reproduce this for several snapshots of the old cluster, as well as when restoring the snapshots in a 3-node local cluster.

I have successfully restored snapshots of 'fresh' clusters with otherwise the exact same configuration, which leads me to think there is something wrong with the state of my original cluster.

To Reproduce I struggle to give instructions to reproduce this, because it only happens with snapshots of the original development cluster,

Steps to reproduce the behavior:

  1. Start a new vault cluster (3-out-of-7 Shamir seal)
  2. Initialise
  3. Unseal
  4. valut operator raft snapshot restore -force original-cluster.backup
  5. Try to unseal

Expected behavior Unsealing the restored cluster to just succeed

Environment:

Additional context Most interestingly, I can unseal the same snapshot on 1-node raft clusters. I also managed to restore the snapshot on a new 5-node cluster by:

This last bit above is what makes me think that this is a bug, as my snapshot can be restored on 1-node clusters, and rekeying it does allow me to restore it on 5-node clusters.

We would like to migrate to raft in prod too, but first we need to be confident that we will be able to restore production snapshots. Unfortunately that is much harder for us to test, as we would rather not restore the production snapshots if we can help it. Because of this we are keen to understand why snapshots from the original cluster cannot be restored easily.

mpalmi commented 3 days ago

Thanks for the report, @cottand! The error in question appears to be coming from this call returning nil, which could be happening for a number of reasons.

Would you mind sharing your current server configuration (with any PII obscured, of course)? This will be helpful in reproducing the behavior you’re seeing.

Aside from that, there should be more detailed errors in Vault’s operational logs. Would you mind checking your operational logs for ERROR-level lines that seem relevant? Sorry for not having an immediate answer, but I’m confident we can get to the bottom of this!

cottand commented 3 days ago

Could you list some of the reasons why the call could be returning nil?

In the meantime, here is the Vault config of the cluster the original snapshot is from, which is itself almost identical to the config of the cluster I am restoring into:

storage "raft" {
    path = "/vault/raft-db"
    performance_multiplier = 1

    retry_join {
        auto_join = "provider=aws region=eu-west-1 tag_key=role tag_value=_______"
        auto_join_port = 8300
        leader_ca_cert_file = "/etc/ssl/certs/root-ca.pem"
        leader_client_cert_file = "/etc/certs/vault/cert.pem"
        leader_client_key_file = "/etc/certs/vault/key.pem"
    }
}

listener "tcp" {
    address     = "0.0.0.0:8200"
    tls_disable = 1
}

listener "tcp" {
    address     = "0.0.0.0:8300"
    tls_cert_file = "/etc/certs/vault/cert.pem"
    tls_key_file  = "/etc/certs/vault/key.pem"
}

api_addr = "https://_________________:8300"
cluster_addr = "https://_________________:8301"

plugin_directory = "/opt/vault/plugins"

disable_mlock = true
ui = true

telemetry {
  statsd_address = "127.0.0.1:9125"
}
cottand commented 3 days ago

Here are some logs from a local test cluster when performing the 3rd unseal key input. Note these are interleaved journalctl logs of several servers (server1, server2, server3, on the left column).

This is an output from a scripted environment, where unseal commands are run directly in each node (which is why you see some client output too)

server1 # [   37.989086] vault[832]: 2024-11-12T09:28:12.277Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
server1 # [   38.590723] vault[832]: 2024-11-12T09:28:12.878Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
server2 # [   38.353526] vault[836]: 2024-11-12T09:28:12.637Z [INFO]  storage.raft: snapshot network transfer progress: read-bytes=52739096 percent-complete="100.00%"
server2 # [   38.378866] vault[836]: 2024-11-12T09:28:12.663Z [INFO]  storage.raft: copied to local snapshot: bytes=52739096
server2 # [   38.380883] vault[836]: 2024-11-12T09:28:12.665Z [INFO]  storage.raft.fsm: installing snapshot to FSM
server2 # [   38.384020] vault[836]: 2024-11-12T09:28:12.667Z [INFO]  storage.raft.fsm: snapshot installed
server2 # [   38.386350] vault[836]: 2024-11-12T09:28:12.671Z [INFO]  storage.raft: snapshot restore progress: id=3-5923866-1731403689926 last-index=5923866 last-term=3 size-in-bytes=0 read-bytes=0 percent-complete="NaN%"
server2 # [   38.389233] vault[836]: 2024-11-12T09:28:12.674Z [INFO]  storage.raft: Installed remote snapshot
server3 # [   38.065012] vault[819]: 2024-11-12T09:28:13.356Z [INFO]  storage.raft: snapshot network transfer progress: read-bytes=52739096 percent-complete="100.00%"
server3 # [   38.086935] vault[819]: 2024-11-12T09:28:13.378Z [INFO]  storage.raft: copied to local snapshot: bytes=52739096
server3 # [   38.087722] vault[819]: 2024-11-12T09:28:13.379Z [INFO]  storage.raft.fsm: installing snapshot to FSM
server3 # [   38.088474] vault[819]: 2024-11-12T09:28:13.380Z [INFO]  storage.raft.fsm: snapshot installed
server3 # [   38.091062] vault[819]: 2024-11-12T09:28:13.383Z [INFO]  storage.raft: snapshot restore progress: id=3-5923866-1731403690574 last-index=5923866 last-term=3 size-in-bytes=0 read-bytes=0 percent-complete="NaN%"
server3 # [   38.091761] vault[819]: 2024-11-12T09:28:13.383Z [INFO]  storage.raft: Installed remote snapshot
server2 # [   38.483306] vault[836]: 2024-11-12T09:28:12.768Z [WARN]  storage.raft: skipping application of old log: index=5923866
server1 # [   38.873730] vault[832]: 2024-11-12T09:28:13.162Z [INFO]  storage.raft: pipelining replication: peer="{Nonvoter server2 server2:8201}"
server3 # [   38.188332] vault[819]: 2024-11-12T09:28:13.480Z [WARN]  storage.raft: skipping application of old log: index=5923866
server1 # [   38.970451] vault[832]: 2024-11-12T09:28:13.258Z [INFO]  storage.raft: pipelining replication: peer="{Nonvoter server3 server3:8201}"
server1 # [   38.992499] vault[832]: 2024-11-12T09:28:13.280Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
server1 # [   39.598119] vault[832]: 2024-11-12T09:28:13.886Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
server1 # [   40.333612] vault[832]: 2024-11-12T09:28:14.621Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
server1 # [   40.880143] vault[832]: 2024-11-12T09:28:15.167Z [INFO]  core: running post snapshot restore invalidations
server1 # [   40.880789] vault[832]: 2024-11-12T09:28:15.168Z [INFO]  core: failed to perform key upgrades, sealing: error="error reloading root key: error reloading root key: failed to read root key path: no decryption key available for term 2"
server1 # [   40.882057] vault[832]: 2024-11-12T09:28:15.168Z [INFO]  core: marked as sealed
server1 # [   40.882659] vault[832]: 2024-11-12T09:28:15.168Z [INFO]  core: pre-seal teardown starting
server1 # [   40.883349] vault[832]: 2024-11-12T09:28:15.168Z [INFO]  core: stopping raft active node
server1 # [   40.884039] vault[832]: 2024-11-12T09:28:15.168Z [INFO]  core: pre-seal teardown complete
server1 # [   40.885256] vault[832]: 2024-11-12T09:28:15.173Z [ERROR] core: unlocking HA lock failed: error="cannot find peer"
server1 # [   40.886054] vault[832]: 2024-11-12T09:28:15.173Z [ERROR] storage.raft.raft-net: failed to accept connection: error="Raft RPC layer closed"
server1 # [   40.886713] vault[832]: 2024-11-12T09:28:15.173Z [INFO]  storage.raft: aborting pipeline replication: peer="{Nonvoter server3 server3:8201}"
server1 # [   40.887256] vault[832]: 2024-11-12T09:28:15.173Z [INFO]  storage.raft: aborting pipeline replication: peer="{Nonvoter server2 server2:8201}"
server1 # [   40.887788] vault[832]: 2024-11-12T09:28:15.173Z [INFO]  core: stopping cluster listeners
server1 # [   40.888524] vault[832]: 2024-11-12T09:28:15.173Z [INFO]  core.cluster-listener: forwarding rpc listeners stopped
server3 # [   40.271642] vault[819]: 2024-11-12T09:28:15.562Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server1 # [   41.296277] vault[832]: 2024-11-12T09:28:15.583Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
server1 # [   41.297096] vault[832]: 2024-11-12T09:28:15.584Z [INFO]  core.cluster-listener: rpc listeners successfully shut down
server1 # [   41.297609] vault[832]: 2024-11-12T09:28:15.584Z [INFO]  core: cluster listeners successfully shut down
server1 # [   41.300927] vault[832]: 2024-11-12T09:28:15.586Z [INFO]  core: vault is sealed
(finished: must succeed: vault operator unseal _________________________________________, in 3.65 seconds)
server1: must succeed: vault operator unseal _________________________________________
(finished: must succeed: vault operator unseal _________________________________________, in 0.11 seconds)
server1: must succeed: vault operator unseal _________________________________________
server1 # Error unsealing: Error making API request.
server1 #
server1 # URL: PUT http://server1:8200/v1/sys/unseal
server1 # Code: 500. Errors:
server1 #
server1 # * failed to create cipher: crypto/aes: invalid key size 0                  ## <<<<<<<<<<<< the API-level error
server1: output:
command `vault operator unseal ____________________________________` failed (exit code 2)
>>> server2 # [   46.613746] vault[836]: 2024-11-12T09:28:20.897Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [   47.069448] vault[819]: 2024-11-12T09:28:22.360Z [WARN]  storage.raft: not part of stable configuration, aborting election
server3 # [   50.272312] vault[819]: 2024-11-12T09:28:25.563Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [   52.214088] vault[836]: 2024-11-12T09:28:26.498Z [WARN]  storage.raft: not part of stable configuration, aborting election
server2 # [   56.615193] vault[836]: 2024-11-12T09:28:30.898Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [   60.273123] vault[819]: 2024-11-12T09:28:35.564Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [   66.615611] vault[836]: 2024-11-12T09:28:40.899Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [   70.273883] vault[819]: 2024-11-12T09:28:45.565Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [   76.615875] vault[836]: 2024-11-12T09:28:50.900Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [   80.274173] vault[819]: 2024-11-12T09:28:55.565Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [   86.617118] vault[836]: 2024-11-12T09:29:00.901Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [   90.275506] vault[819]: 2024-11-12T09:29:05.566Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [   96.617767] vault[836]: 2024-11-12T09:29:10.902Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [  100.275955] vault[819]: 2024-11-12T09:29:15.566Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [  106.618096] vault[836]: 2024-11-12T09:29:20.902Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [  110.275971] vault[819]: 2024-11-12T09:29:25.567Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [  116.618969] vault[836]: 2024-11-12T09:29:30.903Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [  120.277206] vault[819]: 2024-11-12T09:29:35.568Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [  126.619124] vault[836]: 2024-11-12T09:29:40.903Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [  130.277792] vault[819]: 2024-11-12T09:29:45.568Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
>>> server2 # [  136.619812] vault[836]: 2024-11-12T09:29:50.904Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [  140.278102] vault[819]: 2024-11-12T09:29:55.569Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [  146.619942] vault[836]: 2024-11-12T09:30:00.904Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [  150.278923] vault[819]: 2024-11-12T09:30:05.569Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"