Closed awolde closed 3 years ago
Hey, thanks! We're working on bringing some of these features to Vault.
Vault 1.7 includes basic Autopilot support which includes removal of stale nodes.
Fyi; this is not an autopilot error. We had this exact same issue upgrading using the rMIG (with gcpkms + raft + tls autojoin) with as recent as Vault 1.12.0. The root cause seems to be related to the total amount of nodes in your cluster.
What doesn't work;
current 3 nodes -> new version with MIG -> add 3 new version nodes -> 6 total nodes (3 old version, 3 new) -> delete 3 old nodes -> issue as described here.
What works;
current 5 nodes -> new version with MIG -> add 3 new version nodes -> 8 total nodes (5 old version, 3 new) -> delete 3 old nodes -> repeat until all 5 nodes are new version
We suspect it has to do with this;
In the situation that does not work, we scale to an amount of nodes - that does not allow you to scale back to the original node count.
Hashicorp seems to want you to just buy enterprise, which has upgrading functionality built-in - hence documentation is lacking for the open-source version for doing upgrading in an automated way in production. Their current documentation is just; do it manually (https://developer.hashicorp.com/vault/docs/upgrading#ha-installations).
Nodes in a raft cluster that have been shutdown still linger in the raft database and clog vault logs, ultimately leading to raft cluster failure.
Typical log entries:
I have an immutable deployment where I bring 5 vault nodes with auto unseal, and upgrade is performed by changing the Managed Instance Group (MIG) in GCP. Removing one node at a time will leave the raft cluster in inconsistent state. This typically happens if I do multiple upgrades of the MIG and there are a lot of nodes (that have been deleted) in the raft cluster still lingering around.
My config:
Describe the solution you'd like Have something like Consul where if a node is unresponsive for couple of trials, it should be taken out of the cluster. I believe the flag in Consul is
leave_on_terminate = true
. That way I dont have to upgrade one node at a time, I can bring up a new set of 5 nodes and kill the old 5 nodes and I'm done :)Describe alternatives you've considered I'm thinking of a writing a cron job that will ping the nodes in the cluster and remove them if they are not responsive. Cron jobs at least in Linux run at most every minute and so many things can happen in the raft cluster in a minute. I have to also think about how to pass the auth token to the vault nodes securely so that they can run
vault operator raft remove-peer
command. Not ideal!