Open gnugnug opened 3 years ago
I am having the same issue :-( Version: Vault v1.8.2 and v1.9.0
I have found a workaround, it is not ideal but is working.
After doing the migration in the 1st node, I restart it, 2 times. I can see writing raft TLS keyring to storage
just on the 1st time, maybe is related.
Then I bring up each other node one by one, first time each of them fails with the same errors about TLS, so I delete everything in the raft data dir, remove the node from the cluster with remove-peer
and restart the pod, once that is done, it joins the cluster without problems.
I had this issue and I found a fix that I believe is safe. After migrating the data from consul, the vault node address is not correct, I was able to add nodes to the cluster but as soon as one node was restarted the cluster was lost and couldn't recover. This is probably caused by the previous configuration using a consul agent sidecar on 127.0.0.1
to reach the backend.
vault operator raft list-peers
Node Address State Voter
---- ------- ----- -----
vault-0 127.0.0.1:8201 leader true
To fix this I scaled the statefulset to 1
node, and then used the instructions from this article How to recover from permanently lost quorum while using Raft integrated storage with Vault. with the following file:
[
{
"id": "vault-0",
"address": "vault-0.vault-internal:8201",
"non_voter": false
}
]
Restarting the pod is going to update the internal address to the correct one (confirmation is in the pod logs). Then I was able to add more nodes to the cluster and restart any of them (including vault-0
) and the raft cluster always recovered.
Hope it helps, not sure if this issue is by design due to the restore, or something is not working as it should.
Migration on a single node and with at least one restart before any join is attempted makes most sense - else there are too many changes being attempted in any given time.
@gnugnug - I'm curious if you've retested this flow in the most recent versions and if it's still applicable for you?
Environment:
Background: We have a Vault instance using the file storage backend. We want to migrate it to raft integrated storage. Therefor we performed the following steps:
Now we switch to node02, start Vault there and join it to the cluster:
Problem: As soon as we unseal node02 is will start communicating with node01 but run into the following error:
On node01 we see the following error message:
What's interesting is that node02 is logging the line "core: writing raft TLS keyring to storage". The raft cluster already has a keyring created by node01. node02 shouldn't create a keyring as well?!
The behaviour is reproducible, the error messages stay exactly the same on every join. Even if we restart both nodes the cluster join never succeeds.
Workaround: However, if we delete about 20.000 secrets before joining node02 to the cluster, then the join works without problems. So it cannot be a permission or network issue, but looks more like a timing issue. Can you have a look into this?