hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
30.94k stars 4.18k forks source link

Broken rotation of cert/key generated/used by Vault for Consul secrets engine backend #26670

Open cheeseburgermotivated opened 5 months ago

cheeseburgermotivated commented 5 months ago

Describe the bug Rotating the client certificate/key used in a Consul secret backend for Vault does not seem possible when Vault itself is used to bootstrap the Consul ACL system. The /{mount}/config/access endpoint which is used to set the certificate/key will overwrite all fields, so when we send an updated cert without the bootstrap token, Vault assumes that we want to bootstrap the ACL system of the associated Consul cluster, which fails since Consul says No. We don't have the token because Vault swallows it when it issues the bootstrap command to Consul, and we shouldn't need to know it since Vault is in charge of the process and should have it stored somewhere.

This failure looks like:

vault_pki_secret_backend_cert.consul-dev-client-cert: Destroying... [id=internalca/vault-consul-dev/consul-dev-client.vault.dev.supercompany.com]
vault_pki_secret_backend_cert.consul-dev-client-cert: Destruction complete after 0s
vault_pki_secret_backend_cert.consul-dev-client-cert: Creating...
vault_pki_secret_backend_cert.consul-dev-client-cert: Creation complete after 1s [id=internalca/vault-consul-dev/consul-dev-client.vault.dev.supercompany.com]
vault_consul_secret_backend.consul-dev: Modifying... [id=consul-dev]
╷
│ Error: error configuring Consul configuration for "consul-dev": Error making API request.
│
│ URL: PUT https://vault.dev.supercompany.com:8200/v1/consul-dev/config/access
│ Code: 400. Errors:
│
│ * Token not provided and failed to bootstrap ACLs: Unexpected response code: 403 (Permission denied: rpc error making call: ACL bootstrap no longer allowed (reset index: 12345))
│
│   with vault_consul_secret_backend.consul-dev,
│   on consul-dev.tf line 32, in resource "vault_consul_secret_backend" "consul-dev":
│   32: resource "vault_consul_secret_backend" "consul-dev" {
│
╵

The above did not actually result in a failure immediately, however. The old certificate was deleted, the new one was generated, but the consul secrets engine was not yet aware of the change. After the certificate actually expired, communication between Consul and Vault stopped. This failure looked like so:

Vault error occurred: Put "https://consul.dev.supercompany.com:8501/v1/acl/token": remote error: tls: bad certificate, on get https://vault.dev.supercompany.com:8200/v1/consul-dev/creds/consul-server

To Reproduce Steps to reproduce the behavior:

  1. Create docker network a. docker network create --driver bridge bstok

  2. Launch Vault container a. docker run -dit --name vault.superveryreallydefinitelybogus.com --network bstok -p 8200:8200 --cap-add=IPC_LOCK -e 'VAULT_DEV_ROOT_TOKEN_ID=myroot' -e 'VAULT_DEV_LISTEN_ADDRESS=0.0.0.0:8200' hashicorp/vault

  3. Create sample PKI infrastructure using Terraform a. export VAULT_ADDR=http://localhost:8200 b. export VAULT_TOKEN=myroot c. mkdir -p bstok/terraform bstok/docker/consul/config/tls d.

    
    cat << 'EOF' > bstok/terraform/main.tf
    terraform {
    required_version = ">= 1.5.3"
    required_providers {
    vault = "~> 3.18.0"
    }
    }

provider "vault" { address = "http://localhost:8200" }

variable "base_domain" { type = string description = "The domain name the CA will issue certificates for" default = "superveryreallydefinitelybogus.com" }

root CA

resource "vault_mount" "pki_root" { path = "pki_root" type = "pki" description = "This is an example PKI root"

max_lease_ttl_seconds = 315360000 #10y }

resource "vault_pki_secret_backend_root_cert" "root" { backend = vault_mount.pki_root.path type = "internal" common_name = var.base_domain ttl = "87600h" #10y }

resource "vault_pki_secret_backend_config_urls" "config_urls" { backend = vault_mount.pki_root.path issuing_certificates = ["http://localhost:8200/v1/pki/ca"] crl_distribution_points = ["http://localhost:8200/v1/pki/crl"] }

intermediate CA

resource "vault_mount" "pki_intermediate" { path = "pki_intermediate" type = "pki" description = "This is an example PKI intermediate"

max_lease_ttl_seconds = 15780000 #5y }

resource "vault_pki_secret_backend_intermediate_cert_request" "intermediate_request" { backend = vault_mount.pki_intermediate.path type = "internal" common_name = "${var.base_domain} Intermediate Authority" }

resource "vault_pki_secret_backend_root_sign_intermediate" "signed_intermediate" { backend = vault_mount.pki_root.path csr = vault_pki_secret_backend_intermediate_cert_request.intermediate_request.csr common_name = vault_pki_secret_backend_intermediate_cert_request.intermediate_request.common_name }

resource "vault_pki_secret_backend_intermediate_set_signed" "set_signed" { backend = vault_mount.pki_intermediate.path certificate = vault_pki_secret_backend_root_sign_intermediate.signed_intermediate.certificate }

roles

resource "vault_pki_secret_backend_role" "server_role" { backend = vault_mount.pki_intermediate.path name = "server_role" max_ttl = 259200 #72h

allowed_domains = [var.base_domain] allowed_uri_sans = ["server.dc1.consul"] allow_any_name = false allow_glob_domains = true allow_ip_sans = true allow_subdomains = true enforce_hostnames = true

client_flag = false server_flag = true }

client certs

resource "vault_pki_secret_backend_cert" "vault-consul" { backend = vault_mount.pki_intermediate.path name = vault_pki_secret_backend_role.server_role.name

common_name = "consul.${var.base_domain}" ttl = 3000 min_seconds_remaining = 2400 auto_renew = true }

resource "vault_pki_secret_backend_cert" "consul-server" { backend = vault_mount.pki_intermediate.path name = vault_pki_secret_backend_role.server_role.name

common_name = "consul.${var.base_domain}" ttl = 28800 min_seconds_remaining = 14400 auto_renew = true }

consul secrets engine

resource "vault_consul_secret_backend" "consul" {

path = "consul"

address = "consul.superveryreallydefinitelybogus.com:8501"

scheme = "https"

bootstrap = true

ca_cert = vault_pki_secret_backend_cert.vault-consul.ca_chain

client_cert = join("\n", [

vault_pki_secret_backend_cert.vault-consul.certificate,

vault_pki_secret_backend_cert.vault-consul.ca_chain

])

client_key = vault_pki_secret_backend_cert.vault-consul.private_key

}

EOF


  e. `cd bstok/terraform`
  f. `terraform init`
  g. `terraform plan -out=myplan`
  h. `terraform apply myplan`

5. Request certs for Consul
  a. `cd ..`
  b. `vault write -format=json pki_intermediate/issue/server_role common_name="consul.superveryreallydefinitelybogus.com" uri_sans="server.dc1.consul" > certout.json`
  c. `cat certout.json | jq -r '.data.private_key' > docker/consul/config/tls/cert.key`
  d. `cat certout.json | jq -r '.data.ca_chain[0]' > docker/consul/config/tls/ca.crt`
  e. `cat certout.json | jq -r '.data.certificate' > docker/consul/config/tls/cert.crt`
  f. `cat docker/consul/config/tls/cert.crt docker/consul/config/tls/ca.crt > docker/consul/config/tls/certcombined.crt`

6. Create small consul config file
  a.

cat << EOF > docker/consul/config/config.json { "server": true, "log_level": "DEBUG", "node_name": "consul-docker", "tls": { "defaults": { "ca_file": "/consul/config/tls/ca.crt", "cert_file": "/consul/config/tls/certcombined.crt", "key_file": "/consul/config/tls/cert.key", "verify_incoming": true, "verify_outgoing": true } }, "acl": { "enabled": true, "default_policy": "allow" }, "ports": { "https": 8501, "grpc_tls": 8503 }, "encrypt": "pmsKacTdVOb4x8/Vtr9PWw==" } EOF


7. Launch consul container
 a. `docker run  -dit --name consul.superveryreallydefinitelybogus.com --network bstok -p 8300:8300 -p 8501:8501 -p 8600:8600/udp -v $(pwd)/docker/consul:/consul hashicorp/consul:1.15.10 agent -server -bootstrap -ui -client=0.0.0.0`

8. Validate that Consul is running
  a. `docker exec -it consul.superveryreallydefinitelybogus.com consul members`

9. Uncomment Vault<->Consul backend creation in TF for Vault
  a. `cd terraform`
  b. `sed -i.commented 's/^##//' main.tf`

10. Create Vault<->Consul backend
  a. `terraform plan -out=myplan`
  b. `terraform apply myplan`
    i. Should see from TF and Consul logs

vault_consul_secret_backend.consul: Creating... vault_consul_secret_backend.consul: Creation complete after 0s [id=consul]

2024-04-26 11:07:45 2024-04-26T15:07:45.084Z [INFO] agent.server.acl: ACL bootstrap completed 2024-04-26 11:07:45 2024-04-26T15:07:45.087Z [DEBUG] agent.http: Request finished: method=PUT url=/v1/acl/bootstrap from=172.20.0.2:55116 latency=6.781417ms 2024-04-26 11:08:48 2024-04-26T15:08:48.791Z [DEBUG] agent: Skipping remote check since it is managed automatically: check=serfHealth


11. Rotate client cert used with Vault<->Consul
  a. `terraform plan -replace vault_pki_secret_backend_cert.vault-consul -out=myplan`
    i. Should see `'# vault_consul_secret_backend.consul will be updated in-place'` with a new cert and key
       Should see `'# vault_pki_secret_backend_cert.vault-consul` must be replaced'
  b. `terraform apply myplan`
    i. boom :(
% terraform apply myplan
vault_pki_secret_backend_cert.vault-consul: Destroying... [id=pki_intermediate/server_role/consul.superveryreallydefinitelybogus.com]
vault_pki_secret_backend_cert.vault-consul: Destruction complete after 0s
vault_pki_secret_backend_role.server_role: Modifying... [id=pki_intermediate/roles/server_role]
vault_pki_secret_backend_role.server_role: Modifications complete after 0s [id=pki_intermediate/roles/server_role]
vault_pki_secret_backend_cert.vault-consul: Creating...
vault_pki_secret_backend_cert.vault-consul: Creation complete after 3s [id=pki_intermediate/server_role/consul.superveryreallydefinitelybogus.com]
vault_consul_secret_backend.consul: Modifying... [id=consul]
╷
│ Error: error configuring Consul configuration for "consul": Error making API request.
│
│ URL: PUT http://localhost:8200/v1/consul/config/access
│ Code: 400. Errors:
│
│ * Token not provided and failed to bootstrap ACLs: Unexpected response code: 403 (Permission denied: ACL bootstrap no longer allowed (reset index: 23))
│
│   with vault_consul_secret_backend.consul,
│   on main.tf line 125, in resource "vault_consul_secret_backend" "consul":
│  125: resource "vault_consul_secret_backend" "consul" {


**Expected behavior**
We expected just the new client certificate+key to be set for the secret backend.

**Environment:**
* Vault Server Version: 1.14.1
* Vault CLI Version: 1.14.1
* Server Operating System/Architecture: Debian 11

Vault server configuration file(s): Can reproduce with bare dev container as shown above

**Additional context**
I have to believe that we are doing something incorrectly here, this seems like a big oversight?
Probable related report: https://github.com/hashicorp/vault/issues/9056
heatherezell commented 4 months ago

Hi there - I asked our resident Vault/Consul expert and she said that if you're in need of workarounds, this might help. In the meantime I'll bring this to our engineering teams as well. Thanks!

cheeseburgermotivated commented 4 months ago

Thanks for the update. We kinda-sorta did something similar to recover. We turned off verify_incoming for the consul server nodes, requested a new global-management token from Vault, logged in to the UI with that token, dug around in the issued tokens until we happened to find the token that was initially issued for bootstrap, put that token at a path in Vault, and then use data.vault_generic_secret to pull that token from Vault for use within the vault_consul_secret_backend. Then we turned verify_incoming back on. However, I do like your idea better so I'll play with it in our lab.

Thanks for bringing this up to the engineering team as well as the Vault/Consul expert. I'll remain subscribed to this issue for future updates.