hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
31.04k stars 4.2k forks source link

Remove stale raft nodes #10549

Closed awolde closed 3 years ago

awolde commented 3 years ago

Nodes in a raft cluster that have been shutdown still linger in the raft database and clog vault logs, ultimately leading to raft cluster failure.

Typical log entries:

Dec 14 05:50:35 vault-pri-cdtv vault: 2020-12-14T05:50:35.242Z [WARN]  storage.raft: rejecting vote request since we have a leader: from=10.0.0.214:8201 leader=10.0.0.208:8201
Dec 14 05:50:35 vault-pri-cdtv vault: 2020-12-14T05:50:35.959Z [WARN]  storage.raft: rejecting vote request since we have a leader: from=10.0.0.212:8201 leader=10.0.0.208:8201
Dec 14 05:50:37 vault-pri-cdtv vault: 2020-12-14T05:50:37.791Z [WARN]  storage.raft: rejecting vote request since we have a leader: from=10.0.0.210:8201 leader=10.0.0.208:8201
Dec 14 05:50:38 vault-pri-cdtv vault: 2020-12-14T05:50:38.905Z [WARN]  storage.raft: rejecting vote request since we have a leader: from=10.0.0.207:8201 leader=10.0.0.208:8201
Dec 14 05:50:39 vault-pri-cdtv vault: 2020-12-14T05:50:39.125Z [WARN]  storage.raft: heartbeat timeout reached, starting election: last-leader=10.0.0.208:8201
Dec 14 05:50:39 vault-pri-cdtv vault: 2020-12-14T05:50:39.125Z [INFO]  storage.raft: entering candidate state: node="Node at 10.0.0.215:8201 [Candidate]" term=8
Dec 14 05:50:40 vault-pri-cdtv vault: 2020-12-14T05:50:40.552Z [INFO]  storage.raft: entering follower state: follower="Node at 10.0.0.215:8201 [Follower]" leader=
Dec 14 05:50:49 vault-pri-cdtv vault: 2020-12-14T05:50:49.128Z [ERROR] storage.raft: failed to make requestVote RPC: target="{Voter vault-pri-lnkp 10.0.0.211:8201}" error="dial tcp 10.0.0.211:8201: i/o timeout"
Dec 14 06:21:16 vault-pri-cdtv vault: 2020-12-14T06:21:16.084Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing remote error: tls: internal error""
Dec 14 06:21:16 vault-pri-cdtv vault: 2020-12-14T06:21:16.084Z [ERROR] core: forward request error: error="error during forwarding RPC request"
Dec 14 06:21:16 vault-pri-cdtv vault: 2020-12-14T06:21:16.087Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing remote error: tls: internal error""
Dec 14 06:21:16 vault-pri-cdtv vault: 2020-12-14T06:21:16.087Z [ERROR] core: forward request error: error="error during forwarding RPC request"
Dec 14 06:21:16 vault-pri-cdtv vault: 2020-12-14T06:21:16.088Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing remote error: tls: internal error""
Dec 14 06:21:16 vault-pri-cdtv vault: 2020-12-14T06:21:16.088Z [ERROR] core: forward request error: error="error during forwarding RPC request"
Dec 14 06:21:16 vault-pri-cdtv vault: 2020-12-14T06:21:16.089Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing remote error: tls: internal error""
Dec 14 06:21:16 vault-pri-cdtv vault: 2020-12-14T06:21:16.089Z [ERROR] core: forward request error: error="error during forwarding RPC request"
Dec 14 06:21:21 vault-pri-cdtv vault: 2020-12-14T06:21:21.240Z [WARN]  storage.raft: heartbeat timeout reached, starting election: last-leader=10.0.0.213:8201
Dec 14 06:21:21 vault-pri-cdtv vault: 2020-12-14T06:21:21.240Z [INFO]  storage.raft: entering candidate state: node="Node at 10.0.0.215:8201 [Candidate]" term=10
Dec 14 06:21:21 vault-pri-cdtv vault: 2020-12-14T06:21:21.242Z [ERROR] storage.raft: failed to make requestVote RPC: target="{Voter vault-pri-pr7t 10.0.0.210:8201}" error="write tcp 10.0.0.215:54478->10.0.0.210:8201: write: connection timed out"
Dec 14 06:21:21 vault-pri-cdtv vault: 2020-12-14T06:21:21.243Z [ERROR] storage.raft: failed to make requestVote RPC: target="{Voter vault-pri-j87c 10.0.0.209:8201}" error="write tcp 10.0.0.215:38880->10.0.0.209:8201: write: connection timed out"
Dec 14 06:21:21 vault-pri-cdtv vault: 2020-12-14T06:21:21.243Z [ERROR] storage.raft: failed to make requestVote RPC: target="{Voter vault-pri-fjbd 10.0.0.212:8201}" error="write tcp 10.0.0.215:50160->10.0.0.212:8201: write: connection timed out"
Dec 14 06:21:21 vault-pri-cdtv vault: 2020-12-14T06:21:21.247Z [ERROR] storage.raft: failed to make requestVote RPC: target="{Voter vault-pri-ss55-version-0-0-11 10.0.0.218:8201}" error="dial tcp 10.0.0.218:8201: connect: connection refused"
Dec 14 06:21:21 vault-pri-cdtv vault: 2020-12-14T06:21:21.630Z [INFO]  storage.raft: duplicate requestVote for same term: term=10

I have an immutable deployment where I bring 5 vault nodes with auto unseal, and upgrade is performed by changing the Managed Instance Group (MIG) in GCP. Removing one node at a time will leave the raft cluster in inconsistent state. This typically happens if I do multiple upgrades of the MIG and there are a lot of nodes (that have been deleted) in the raft cluster still lingering around.

My config:

cluster_addr = "http://10.0.0.215:8201"
api_addr = "http://10.0.0.215:8200"
ui = true
storage "raft" {
  path    = "/opt/vault/data/"
  node_id = "vault-pri-cdtv"

  retry_join {
    auto_join = "provider=gce project_name=project-id tag_value=vault-deplyment-v1.0.0"
    auto_join_scheme = "http"
  }
}

listener "tcp" {
  address     = "0.0.0.0:8200"
  cluster_address = "0.0.0.0:8201"
  tls_disable = true
}
seal "gcpckms" {
    project     = "project-id"
    region      = "us-central1"
    key_ring    = "vault-keyring"
    crypto_key  = "vault-key"
}

Describe the solution you'd like Have something like Consul where if a node is unresponsive for couple of trials, it should be taken out of the cluster. I believe the flag in Consul is leave_on_terminate = true. That way I dont have to upgrade one node at a time, I can bring up a new set of 5 nodes and kill the old 5 nodes and I'm done :)

Describe alternatives you've considered I'm thinking of a writing a cron job that will ping the nodes in the cluster and remove them if they are not responsive. Cron jobs at least in Linux run at most every minute and so many things can happen in the raft cluster in a minute. I have to also think about how to pass the auth token to the vault nodes securely so that they can run vault operator raft remove-peer command. Not ideal!

swayne275 commented 3 years ago

Hey, thanks! We're working on bringing some of these features to Vault.

ncabatoff commented 3 years ago

Vault 1.7 includes basic Autopilot support which includes removal of stale nodes.

fancybear-dev commented 1 year ago

Fyi; this is not an autopilot error. We had this exact same issue upgrading using the rMIG (with gcpkms + raft + tls autojoin) with as recent as Vault 1.12.0. The root cause seems to be related to the total amount of nodes in your cluster.

What doesn't work;

current 3 nodes -> new version with MIG -> add 3 new version nodes -> 6 total nodes (3 old version, 3 new) -> delete 3 old nodes -> issue as described here.

What works;

current 5 nodes -> new version with MIG -> add 3 new version nodes -> 8 total nodes (5 old version, 3 new) -> delete 3 old nodes -> repeat until all 5 nodes are new version

We suspect it has to do with this;

In the situation that does not work, we scale to an amount of nodes - that does not allow you to scale back to the original node count.

Hashicorp seems to want you to just buy enterprise, which has upgrading functionality built-in - hence documentation is lacking for the open-source version for doing upgrading in an automated way in production. Their current documentation is just; do it manually (https://developer.hashicorp.com/vault/docs/upgrading#ha-installations).