hashicorp / consul-template

Template rendering, notifier, and supervisor for @HashiCorp Consul and Vault data.
https://www.hashicorp.com/
Mozilla Public License 2.0
4.75k stars 782 forks source link

Leased secrets are not renewed on token change when using vault-agent #1433

Open danieleva opened 3 years ago

danieleva commented 3 years ago

Consul Template version

v0.25.1

Configuration

vault-agent

vault {
  address = "http://localhost:8200"
}

auto_auth {
  method "approle" {
    config = {
      role_id_file_path                   = "./roleid"
      secret_id_file_path                 = "./secretid"
      remove_secret_id_file_after_reading = false
    }
  }

  sink "file" {
    config = {
      path = "./token"
    }
  }
}

cache {
  use_auto_auth_token = false
}

listener "tcp" {
  address     = "127.0.0.1:8100"
  tls_disable = true
}

template

{{ with secret "consul/creds/foo"}} {{ . }} {{end}}

Command

consul-template -vault-agent-token-file=./token -log-level=debug -template secret.tpl:secret.out

Expected behavior

consul-template reads the vault token from -vault-agent-token-file, renders secrets according to the template. When the vault-token changes on disk, consul-template reload it and refreshes the secrets in the template.

Actual behavior

vault token obtained via vault-agent has a max_ttl=60m, secrets obtained from consul/creds/foo have ttl=50m. consul-template starts, grabs the vault-token from the configured location, grabs the leased secret from vault and renders the template correctly. Every ~17mins+ the dependency watcher triggers a run, the leased secret from consul/creds/foo is refreshed. Fast forward at T+60m, vault-token reaches its max_ttl and cannot be renewed. vault-agent re-auth and gets a new token, all leased secrets generated by the old token are revoked by vault. Consul-template will reload the token from file and updates its vault client, but it doesn't refresh the templates. If the last lease reneweal from consul-template happens close to the vault-token max_ttl the templates will stay with revoked secrets for an entire sleep loop.

The Fetch function for secrets (e.g. https://github.com/hashicorp/consul-template/blob/master/dependency/vault_read.go#L64-L68) has no easy way I can see to force a refresh, unless some refactory/replumbing happens.

Steps to reproduce

It's easy to reproduce using the sample configuration above, set the token max_ttl to 3 mins and secret backend role to 2 mins

eikenb commented 3 years ago

Hey @danieleva, thanks for reporting this.

I did some brief testing and have verified that it should be picking up on the token file change and read in the new token. I tested this manually by starting CT with a blank template w/ the -vault-agent-token-file argument. Then watched the logs as I touched or overwrote the contents of that file.

The token is set on the Vault client then using the standard call.

Basically using a simple, quick setup I tested what I thought were the 2 key systems and I don't see a problem with either. I'm currently doing just a quick sweep, so that's it for now.

danieleva commented 3 years ago

hi, thanks for looking at this. The function that reloads the token from file is working as expected, and it updates the Vault client correctly. The problem is a missing hook/signal to trigger a force-refresh of all leased secrets. The loop responsible for that is not affected by the change in Vault client, so there is a delta between Vault client being updated with a new token and Vault client being used to request secrets. During that time all the secrets generated with the old token are at risk of being revoked by Vault server.

eikenb commented 3 years ago

Thanks for the followup. I think I get it now. To restate, to be sure...

When the agent-token is refreshed in it needs to trigger a general refresh on all Vault values instead of letting them wait for the usual timeout.

danieleva commented 3 years ago

That's spot on 👍 :)

drawks commented 3 years ago

Any updates on this?

eikenb commented 3 years ago

Hey @drawks,

This will probably be included in 0.27.1. I'm preparing for a 0.27.0 now which was initially supposed to be a quick release to update the docker image but sort of blew up a bit when I tried to squeeze in a shell parsing fix which was not fixable without re-writing a large part of that library and led me to drop it. While I'd like to include it in 0.27.0, I think I've delayed that release to long already.

So I'll start working on it for the next-next release. Sorry I don't have better news.

hamishforbes commented 2 years ago

Hi, did this ever get fixed? I believe this is the root cause of the issues I'm seeing with 0.29.0 so I'm guessing not?

I am using the Vault Kubernetes integration to inject a Vault Agent into my pods, the agent then auths using the pod service account and writes the token to a shared volume. When the token expires and is renewed the agent sidecar writes a new token in. Consul-template runs in the main application container in exec mode so it can refresh credentials and restart the application process. This all works fine.

Except when my K8s token reaches max_ttl and a new token is written, consul template blows up trying to renew database credential leases that have been revoked because the parent token has expired. Or worse, consul-template fetches new database credentials, restarts the application and then seconds later the token expires and the (brand new) DB creds are deleted from underneath the app.

I can mitigate this somewhat by ensuring the Vault k8s auth method has a very high max_ttl, but it isn't a foolproof solution. If a pod somehow manages to stay running for long enough it'll still hit this issue. It's also obviously not ideal to have to set the auth token super high, it would be preferable to actually let the auth token expire in a reasonable timeframe in case it gets leaked etc.

eikenb commented 2 years ago

I'm putting together a quick 0.29.1 to address an issue with a new feature but won't be working on this as part of that as it touches some related functionality and I don't want to mess to much with that and end up breaking the thing I'm putting the release out to fix.

It is on my radar for the next round of work. Just didn't want everyone to think I'd forgotten about this when it wasn't in this release. Sorry for the delay.

jsmilani commented 7 months ago

Any updates? We are experiencing similar issue effecting the stability of a number of components we run with consul-template.

jsmilani commented 6 months ago

Checking in again. This is an urgent issue for stability when using token rotation. Can we get an updated timeline on the fix?

jsmilani commented 5 months ago

Ping. Please keep looking into resolving this.

jsmilani commented 4 months ago

Another ping to keep this alive.

nannapureddy commented 4 months ago

Any updates on this issue? This issue is impacting our environment stability and really appreciate any help. Thank you.

jsmilani commented 1 month ago

Still hoping this gets addressed. Still causing instability issues. Please help.