hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
31.32k stars 4.24k forks source link

Leader unresponsive, unavailable #29033

Open ellisroll-b opened 3 days ago

ellisroll-b commented 3 days ago

Environment:

Vault Config File:

# Paste your Vault config here.
# Be sure to scrub any sensitive values
ui                                  = false
disable_mlock                       = true                            #integrated storage - mlock should NOT be enabled
cluster_addr                        = "http://TEMPLATE_HOST_IP:8201"  #must include protocol, replaced by auto_join
api_addr                            = "http://TEMPLATE_HOST_IP:8200"  #must include protocol, replaced by auto_join
enable_response_header_hostname     = "true"
enable_response_header_raft_node_id = "true"

# logging at info level, to files in the listed directory with names of /var/log/vault/vault-epoch.log, where epoch is a timestamp
# max log file of 5M, and max number of older than current at 2
log_level            = "info"
log_file             = "/var/log/vault/"
log_format           = "json"
log_rotate_bytes     = 5242880
log_rotate_max_files = 2

storage "raft" {
  path      = "/mnt/data/vault"
  node_id   = "TEMPLATE_EC2_INST_ID"
  retry_join {
    auto_join        = "provider=aws region=TEMPLATE_AWS_REGION tag_key=VAJC tag_value=TEMPLATE_ACCOUNT_ID-TEMPLATE_DC_NAME-vault-cluster addr_type=private_v4"
    auto_join_scheme = "http"
  }
}

# HTTP listener
listener "tcp" {
  address = "0.0.0.0:8200"
  tls_disable = 1
}

# HTTPS listener
#listener "tcp" {
#  address       = "0.0.0.0:8200"
#  tls_cert_file = "/opt/vault/tls/tls.crt"
#  tls_key_file  = "/opt/vault/tls/tls.key"
#  #tls_client_ca_file = "/opt/vault/tls/vault-ca.pem"
#}

# AWS KMS auto unseal
seal "awskms" {
  region      = "TEMPLATE_AWS_REGION"
  kms_key_id  = "TEMPLATE_KMS_KEY_ID"
}

Startup Log Output:

# Paste your log output here

Expected Behavior:

Ability to change the leader, when the leader is unresponsive. It appears the operator step-down command is redirected from a healthy node to the leader node, and is thus unresponsive as well.

Actual Behavior:

1) Vault logins starting failing, from our other services (approle). 2) Even though we have log rotation/truncation, a developer looked at the exported logs (we export to an S3 bucket) of a 1 (of 5) node cluster and found an error saying it was out of disk space. This happened to be the leader. 3) We attempted to jump into the instance (a docker running in an AWS EC2 instance) and could not connect at all to the EC2 instance. AWS reported the instance was healthy and fine.
4) We jumped into one of the other instances (of 5) and attempted local login. This is redirected to an unresponsive leader, and timed out. 5) We needed to get back to health, so we killed the leader EC2 instance. Vault began working (election happened), and a new EC2 instance spun up and the missing node joined the cluster. 6) however - this operational response, lost the EC2 instance and the ability to debug what was happening in the node or docker container. My own theory is networking on the EC2 instance was messed up and it had not reached the point where AWS reported the instance as unhealthy. 7) questions:

Steps to Reproduce:

Unfortunately I do have the means to reproduce, or logs off the killed node.

Important Factoids:

References:

maheshpoojaryneu commented 3 days ago

In addition to details already shared, the follower's logs have no messages indicating any communication failures to the leader node. Also while trying to login to vault from the follower node, it returned 500. Below is the error that was returned:

Password (will be hidden):
Error authenticating: Error making API request.

URL: PUT http://127.0.0.1:8200/v1/auth/userpass/login/admin
Code: 500. Errors:

* internal error
bosouza commented 11 hours ago

Hi, thanks for the detailed report! As recently discussed in #28846, Vault doesn't have a mechanism to react to a full disk, the Raft heartbeats continue to work while in this situation so a leader election isn't triggered. You're right that in this situation the step-down commands would get forwarded to the unresponsive leader and since it cannot write to the audit device it won't even attempt the operation.

For situations like this forcing a leader election by shutting down the node in some way is a good approach. Another slightly different option you had was to increase the volume size, which might require figuring out the SSH problem in order to increase the partition size at the OS level, or if using an AMI that supports partition resizing during startup you can just restart the node. If none of that was possible you could also use the peers.json approach to restart the healthy nodes and make them forget about the unresponsive node.

Generally, infrastructure monitoring should be used to alert about such OS-level status, giving you time to increase the disk size before the cluster goes offline.