Open ellisroll-b opened 3 days ago
In addition to details already shared, the follower's logs have no messages indicating any communication failures to the leader node. Also while trying to login to vault from the follower node, it returned 500. Below is the error that was returned:
Password (will be hidden):
Error authenticating: Error making API request.
URL: PUT http://127.0.0.1:8200/v1/auth/userpass/login/admin
Code: 500. Errors:
* internal error
Hi, thanks for the detailed report! As recently discussed in #28846, Vault doesn't have a mechanism to react to a full disk, the Raft heartbeats continue to work while in this situation so a leader election isn't triggered. You're right that in this situation the step-down commands would get forwarded to the unresponsive leader and since it cannot write to the audit device it won't even attempt the operation.
For situations like this forcing a leader election by shutting down the node in some way is a good approach. Another slightly different option you had was to increase the volume size, which might require figuring out the SSH problem in order to increase the partition size at the OS level, or if using an AMI that supports partition resizing during startup you can just restart the node. If none of that was possible you could also use the peers.json approach to restart the healthy nodes and make them forget about the unresponsive node.
Generally, infrastructure monitoring should be used to alert about such OS-level status, giving you time to increase the disk size before the cluster goes offline.
Environment:
Vault Config File:
Startup Log Output:
Expected Behavior:
Ability to change the leader, when the leader is unresponsive. It appears the operator step-down command is redirected from a healthy node to the leader node, and is thus unresponsive as well.
Actual Behavior:
1) Vault logins starting failing, from our other services (approle). 2) Even though we have log rotation/truncation, a developer looked at the exported logs (we export to an S3 bucket) of a 1 (of 5) node cluster and found an error saying it was out of disk space. This happened to be the leader. 3) We attempted to jump into the instance (a docker running in an AWS EC2 instance) and could not connect at all to the EC2 instance. AWS reported the instance was healthy and fine.
4) We jumped into one of the other instances (of 5) and attempted local login. This is redirected to an unresponsive leader, and timed out. 5) We needed to get back to health, so we killed the leader EC2 instance. Vault began working (election happened), and a new EC2 instance spun up and the missing node joined the cluster. 6) however - this operational response, lost the EC2 instance and the ability to debug what was happening in the node or docker container. My own theory is networking on the EC2 instance was messed up and it had not reached the point where AWS reported the instance as unhealthy. 7) questions:
Steps to Reproduce:
Unfortunately I do have the means to reproduce, or logs off the killed node.
Important Factoids:
References: