Open cottand opened 4 days ago
Thanks for the report, @cottand! The error in question appears to be coming from this call returning nil, which could be happening for a number of reasons.
Would you mind sharing your current server configuration (with any PII obscured, of course)? This will be helpful in reproducing the behavior you’re seeing.
Aside from that, there should be more detailed errors in Vault’s operational logs. Would you mind checking your operational logs for ERROR-level lines that seem relevant? Sorry for not having an immediate answer, but I’m confident we can get to the bottom of this!
Could you list some of the reasons why the call could be returning nil
?
In the meantime, here is the Vault config of the cluster the original snapshot is from, which is itself almost identical to the config of the cluster I am restoring into:
storage "raft" {
path = "/vault/raft-db"
performance_multiplier = 1
retry_join {
auto_join = "provider=aws region=eu-west-1 tag_key=role tag_value=_______"
auto_join_port = 8300
leader_ca_cert_file = "/etc/ssl/certs/root-ca.pem"
leader_client_cert_file = "/etc/certs/vault/cert.pem"
leader_client_key_file = "/etc/certs/vault/key.pem"
}
}
listener "tcp" {
address = "0.0.0.0:8200"
tls_disable = 1
}
listener "tcp" {
address = "0.0.0.0:8300"
tls_cert_file = "/etc/certs/vault/cert.pem"
tls_key_file = "/etc/certs/vault/key.pem"
}
api_addr = "https://_________________:8300"
cluster_addr = "https://_________________:8301"
plugin_directory = "/opt/vault/plugins"
disable_mlock = true
ui = true
telemetry {
statsd_address = "127.0.0.1:9125"
}
Here are some logs from a local test cluster when performing the 3rd unseal key input. Note these are interleaved journalctl logs of several servers (server1
, server2
, server3
, on the left column).
This is an output from a scripted environment, where unseal commands are run directly in each node (which is why you see some client output too)
server1 # [ 37.989086] vault[832]: 2024-11-12T09:28:12.277Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
server1 # [ 38.590723] vault[832]: 2024-11-12T09:28:12.878Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
server2 # [ 38.353526] vault[836]: 2024-11-12T09:28:12.637Z [INFO] storage.raft: snapshot network transfer progress: read-bytes=52739096 percent-complete="100.00%"
server2 # [ 38.378866] vault[836]: 2024-11-12T09:28:12.663Z [INFO] storage.raft: copied to local snapshot: bytes=52739096
server2 # [ 38.380883] vault[836]: 2024-11-12T09:28:12.665Z [INFO] storage.raft.fsm: installing snapshot to FSM
server2 # [ 38.384020] vault[836]: 2024-11-12T09:28:12.667Z [INFO] storage.raft.fsm: snapshot installed
server2 # [ 38.386350] vault[836]: 2024-11-12T09:28:12.671Z [INFO] storage.raft: snapshot restore progress: id=3-5923866-1731403689926 last-index=5923866 last-term=3 size-in-bytes=0 read-bytes=0 percent-complete="NaN%"
server2 # [ 38.389233] vault[836]: 2024-11-12T09:28:12.674Z [INFO] storage.raft: Installed remote snapshot
server3 # [ 38.065012] vault[819]: 2024-11-12T09:28:13.356Z [INFO] storage.raft: snapshot network transfer progress: read-bytes=52739096 percent-complete="100.00%"
server3 # [ 38.086935] vault[819]: 2024-11-12T09:28:13.378Z [INFO] storage.raft: copied to local snapshot: bytes=52739096
server3 # [ 38.087722] vault[819]: 2024-11-12T09:28:13.379Z [INFO] storage.raft.fsm: installing snapshot to FSM
server3 # [ 38.088474] vault[819]: 2024-11-12T09:28:13.380Z [INFO] storage.raft.fsm: snapshot installed
server3 # [ 38.091062] vault[819]: 2024-11-12T09:28:13.383Z [INFO] storage.raft: snapshot restore progress: id=3-5923866-1731403690574 last-index=5923866 last-term=3 size-in-bytes=0 read-bytes=0 percent-complete="NaN%"
server3 # [ 38.091761] vault[819]: 2024-11-12T09:28:13.383Z [INFO] storage.raft: Installed remote snapshot
server2 # [ 38.483306] vault[836]: 2024-11-12T09:28:12.768Z [WARN] storage.raft: skipping application of old log: index=5923866
server1 # [ 38.873730] vault[832]: 2024-11-12T09:28:13.162Z [INFO] storage.raft: pipelining replication: peer="{Nonvoter server2 server2:8201}"
server3 # [ 38.188332] vault[819]: 2024-11-12T09:28:13.480Z [WARN] storage.raft: skipping application of old log: index=5923866
server1 # [ 38.970451] vault[832]: 2024-11-12T09:28:13.258Z [INFO] storage.raft: pipelining replication: peer="{Nonvoter server3 server3:8201}"
server1 # [ 38.992499] vault[832]: 2024-11-12T09:28:13.280Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
server1 # [ 39.598119] vault[832]: 2024-11-12T09:28:13.886Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
server1 # [ 40.333612] vault[832]: 2024-11-12T09:28:14.621Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
server1 # [ 40.880143] vault[832]: 2024-11-12T09:28:15.167Z [INFO] core: running post snapshot restore invalidations
server1 # [ 40.880789] vault[832]: 2024-11-12T09:28:15.168Z [INFO] core: failed to perform key upgrades, sealing: error="error reloading root key: error reloading root key: failed to read root key path: no decryption key available for term 2"
server1 # [ 40.882057] vault[832]: 2024-11-12T09:28:15.168Z [INFO] core: marked as sealed
server1 # [ 40.882659] vault[832]: 2024-11-12T09:28:15.168Z [INFO] core: pre-seal teardown starting
server1 # [ 40.883349] vault[832]: 2024-11-12T09:28:15.168Z [INFO] core: stopping raft active node
server1 # [ 40.884039] vault[832]: 2024-11-12T09:28:15.168Z [INFO] core: pre-seal teardown complete
server1 # [ 40.885256] vault[832]: 2024-11-12T09:28:15.173Z [ERROR] core: unlocking HA lock failed: error="cannot find peer"
server1 # [ 40.886054] vault[832]: 2024-11-12T09:28:15.173Z [ERROR] storage.raft.raft-net: failed to accept connection: error="Raft RPC layer closed"
server1 # [ 40.886713] vault[832]: 2024-11-12T09:28:15.173Z [INFO] storage.raft: aborting pipeline replication: peer="{Nonvoter server3 server3:8201}"
server1 # [ 40.887256] vault[832]: 2024-11-12T09:28:15.173Z [INFO] storage.raft: aborting pipeline replication: peer="{Nonvoter server2 server2:8201}"
server1 # [ 40.887788] vault[832]: 2024-11-12T09:28:15.173Z [INFO] core: stopping cluster listeners
server1 # [ 40.888524] vault[832]: 2024-11-12T09:28:15.173Z [INFO] core.cluster-listener: forwarding rpc listeners stopped
server3 # [ 40.271642] vault[819]: 2024-11-12T09:28:15.562Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server1 # [ 41.296277] vault[832]: 2024-11-12T09:28:15.583Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
server1 # [ 41.297096] vault[832]: 2024-11-12T09:28:15.584Z [INFO] core.cluster-listener: rpc listeners successfully shut down
server1 # [ 41.297609] vault[832]: 2024-11-12T09:28:15.584Z [INFO] core: cluster listeners successfully shut down
server1 # [ 41.300927] vault[832]: 2024-11-12T09:28:15.586Z [INFO] core: vault is sealed
(finished: must succeed: vault operator unseal _________________________________________, in 3.65 seconds)
server1: must succeed: vault operator unseal _________________________________________
(finished: must succeed: vault operator unseal _________________________________________, in 0.11 seconds)
server1: must succeed: vault operator unseal _________________________________________
server1 # Error unsealing: Error making API request.
server1 #
server1 # URL: PUT http://server1:8200/v1/sys/unseal
server1 # Code: 500. Errors:
server1 #
server1 # * failed to create cipher: crypto/aes: invalid key size 0 ## <<<<<<<<<<<< the API-level error
server1: output:
command `vault operator unseal ____________________________________` failed (exit code 2)
>>> server2 # [ 46.613746] vault[836]: 2024-11-12T09:28:20.897Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [ 47.069448] vault[819]: 2024-11-12T09:28:22.360Z [WARN] storage.raft: not part of stable configuration, aborting election
server3 # [ 50.272312] vault[819]: 2024-11-12T09:28:25.563Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [ 52.214088] vault[836]: 2024-11-12T09:28:26.498Z [WARN] storage.raft: not part of stable configuration, aborting election
server2 # [ 56.615193] vault[836]: 2024-11-12T09:28:30.898Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [ 60.273123] vault[819]: 2024-11-12T09:28:35.564Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [ 66.615611] vault[836]: 2024-11-12T09:28:40.899Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [ 70.273883] vault[819]: 2024-11-12T09:28:45.565Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [ 76.615875] vault[836]: 2024-11-12T09:28:50.900Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [ 80.274173] vault[819]: 2024-11-12T09:28:55.565Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [ 86.617118] vault[836]: 2024-11-12T09:29:00.901Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [ 90.275506] vault[819]: 2024-11-12T09:29:05.566Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [ 96.617767] vault[836]: 2024-11-12T09:29:10.902Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [ 100.275955] vault[819]: 2024-11-12T09:29:15.566Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [ 106.618096] vault[836]: 2024-11-12T09:29:20.902Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [ 110.275971] vault[819]: 2024-11-12T09:29:25.567Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [ 116.618969] vault[836]: 2024-11-12T09:29:30.903Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [ 120.277206] vault[819]: 2024-11-12T09:29:35.568Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [ 126.619124] vault[836]: 2024-11-12T09:29:40.903Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [ 130.277792] vault[819]: 2024-11-12T09:29:45.568Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
>>> server2 # [ 136.619812] vault[836]: 2024-11-12T09:29:50.904Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [ 140.278102] vault[819]: 2024-11-12T09:29:55.569Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server2 # [ 146.619942] vault[836]: 2024-11-12T09:30:00.904Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
server3 # [ 150.278923] vault[819]: 2024-11-12T09:30:05.569Z [ERROR] core: raft tls periodic upgrade check failed: error="no decryption key available for term 2"
Describe the bug
We have a development cluster of Vault that is a few years old, with a 3-out-of-7 Shamir seal structure. After having migrated it to raft, I took a raft snapshot via the API, then proceeded to restore it in another 5-node cluster (also 3-out-of-7 Shamir seal structure). The API call to restore does succeed.
When trying to unseal the newly restored cluster, I get the following error after the last unseal key.
I have managed to reproduce this for several snapshots of the old cluster, as well as when restoring the snapshots in a 3-node local cluster.
I have successfully restored snapshots of 'fresh' clusters with otherwise the exact same configuration, which leads me to think there is something wrong with the state of my original cluster.
To Reproduce I struggle to give instructions to reproduce this, because it only happens with snapshots of the original development cluster,
Steps to reproduce the behavior:
valut operator raft snapshot restore -force original-cluster.backup
Expected behavior Unsealing the restored cluster to just succeed
Environment:
vault status
): 1.17.6vault version
): 1.17.6Additional context Most interestingly, I can unseal the same snapshot on 1-node raft clusters. I also managed to restore the snapshot on a new 5-node cluster by:
vault operator rekey
This last bit above is what makes me think that this is a bug, as my snapshot can be restored on 1-node clusters, and rekeying it does allow me to restore it on 5-node clusters.
We would like to migrate to raft in prod too, but first we need to be confident that we will be able to restore production snapshots. Unfortunately that is much harder for us to test, as we would rather not restore the production snapshots if we can help it. Because of this we are keen to understand why snapshots from the original cluster cannot be restored easily.