hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
30.92k stars 4.18k forks source link

[Proxy] Unable to restart offline with persistent cache #28305

Open carlzogh opened 1 week ago

carlzogh commented 1 week ago

Describe the bug Vault Proxy with persistent static secrets cache & auto-auth enabled is unable to start up offline without trying to connect to Vault Server to renew its token.

Unless the issue is misconfiguration on our side, this prevents us from relying on Vault Proxy for static stability / availability in cases when the Vault Proxy process is restarted during Vault Server unreachability.

To Reproduce

  1. Run Vault Proxy with the provided configuration, and request any secret to ensure cache is created and token + secret are persisted.
  2. Stop Vault Proxy process, ensuring that the cache database is still present and will be used by the next run.
  3. Disconnect from the internet (eg. turn wifi off) to emulate network unreachability.
  4. Run Vault Proxy again with the same configuration, to observe the error with the process failing to start due to it being unable to connect to the Server. eg. logs:
    
    Couldn't start vault with IPC_LOCK. Disabling IPC_LOCK, please use --cap-add IPC_LOCK
    ==> Vault Proxy started! Log data will stream in below:

==> Vault Proxy configuration:

       Api Address 1: http://0.0.0.0:8200
                 Cgo: disabled
           Log Level: trace
             Version: Vault v1.17.5, built 2024-08-30T15:54:57Z
         Version Sha: 4d0c53e84094b8017d32b6e5b7f8142035c8837f

2024-09-06T09:39:38.514Z [INFO] proxy.cache: cache configured: cache_static_secrets=true disable_caching_dynamic_secrets=true 2024-09-06T09:39:38.516Z [TRACE] proxy.cache.cacheboltdb: closing bolt db: path=/var/run/cache/vault-agent-cache.db 2024-09-06T09:39:38.519Z [TRACE] proxy.cache.leasecache: restored token: id=0EHrt 2024-09-06T09:39:38.519Z [TRACE] proxy.cache.leasecache: restored token: id=0eE2O ... other tokens 2024-09-06T09:39:38.521Z [TRACE] proxy.cache.leasecache: restored token: id=zBDQx 2024-09-06T09:39:38.521Z [TRACE] proxy.cache.leasecache: restoring static secret index: id=b0b444679a0300f2cf23b576d35637fae0b93dfda757e1de9c2874eb4a50a7f7 path={namespace}/{kvv2_name}/data/{secret_path} 2024-09-06T09:39:38.521Z [TRACE] proxy.cache.leasecache: restoring capability index: id=e477169eaf91f6a693570a52027f688cedf6ca986b6599ffabcec88bfe3281b0 2024-09-06T09:39:38.521Z [INFO] proxy.cache: loaded memcache from persistent storage 2024-09-06T09:39:38.521Z [DEBUG] proxy.apiproxy: configuring inmem auto-auth sink 2024-09-06T09:39:38.521Z [DEBUG] proxy: would have sent systemd notification (systemd not present): notification=READY=1 2024-09-06T09:39:38.521Z [INFO] proxy.cache.staticsecretcacheupdater: starting static secret cache updater subsystem 2024-09-06T09:39:38.521Z [INFO] proxy.sink.server: starting sink server 2024-09-06T09:39:38.521Z [INFO] proxy.auth.handler: starting auth handler 2024-09-06T09:39:38.521Z [DEBUG] proxy.auth.handler: using preloaded token 2024-09-06T09:39:38.521Z [DEBUG] proxy.auth.handler: lookup-self with preloaded token 2024-09-06T09:39:38.523Z [ERROR] proxy.auth.handler: could not look up token: err="Get \"https://{vault_server}:8200/v1/auth/token/lookup-self\": dial tcp: lookup {vault_server} on 192.168.65.7:53: no such host" backoff=860ms 2024-09-06T09:39:39.388Z [INFO] proxy.auth.handler: authenticating 2024-09-06T09:39:39.392Z [ERROR] proxy.auth.handler: error authenticating: error="Put \"https://{vault_server}:8200/v1/auth/approle/login\": dial tcp: lookup {vault_server} on 192.168.65.7:53: no such host" backoff=860ms 2024-09-06T09:39:40.382Z [INFO] proxy.auth.handler: authenticating ... similar logs

5. Any request to get a cached secret will fail as it waits on Vault Proxy to successfully validate its own token (will never succeed offline).

**Expected behavior**
Vault Proxy is able to persist its authentication token and not need to perform a mandatory token lookup / refresh on startup if it is still valid.

**Environment**:

* Vault Server Version: `1.17.3` (Enterprise)
* Vault Proxy Version: `1.17.5`
* Vault CLI Version: `1.17.5`
* Server Operating System/Architecture: Linux / amd64
* Proxy Operating System/Architecture: macOS / arm64
* Client Operating System/Architecture: macOS / arm64

Vault server configuration file(s):

```hcl
# ref. https://developer.hashicorp.com/vault/docs/agent-and-proxy/autoauth
auto_auth {
    method "approle" {
        mount_path = "auth/approle"
        max_backoff = "10s"
        config = {
            role_id_file_path = "/etc/vault/role_id"  # secrets expected to be mounted as volumes
            secret_id_file_path = "/etc/vault/secret_id"  # secrets expected to be mounted as volumes
            remove_secret_id_file_after_reading = false
            exit_on_err = true
        }
    }
}

# ref. https://developer.hashicorp.com/vault/docs/agent-and-proxy/proxy/apiproxy
api_proxy {
    use_auto_auth_token = "force"
    enforce_consistency = "always"
    when_inconsistent = "retry"
}

# ref. https://developer.hashicorp.com/vault/docs/agent-and-proxy/proxy/caching
cache {
    disable_caching_dynamic_secrets = true
    # ref. https://developer.hashicorp.com/vault/docs/agent-and-proxy/proxy/caching/static-secret-caching
    cache_static_secrets = true
    static_secret_token_capability_refresh_interval = "1d"
    static_secret_token_capability_refresh_behavior = "optimistic"

    persist {
        type = "kubernetes"  # mocking k8s by providing a (secret) static service account JWT token as AAD
        path = "/var/run/cache"
        service_account_token_file = "/var/run/cache-persistence-token"
        exit_on_err = false
        keep_after_import = true
    }
}

# ref. https://developer.hashicorp.com/vault/docs/configuration/listener/tcp
listener "tcp" {
    address = "0.0.0.0:8200"
    tls_disable = true  # http local connections

    telemetry {
        unauthenticated_metrics_access = true
    }
}

# ref. https://developer.hashicorp.com/vault/docs/configuration/telemetry
telemetry {
    enable_hostname_label = true
}

# ref. https://developer.hashicorp.com/vault/docs/agent-and-proxy/proxy#vault-stanza
vault {
    # vault address is configured by means of the VAULT_ADDR environment variable
    # vault namespace is configured by means of the VAULT_NAMESPACE environment variable
    tls_skip_verify = true  # TODO: use actual certificate for instance through volume mounts
}

Running the Vault Proxy (version 1.17.5) with Docker:

docker run --rm -it \
  -p 8200:8200 \
  -v "$(pwd)/vault-proxy/secrets/role_id:/etc/vault/role_id" \
  -v "$(pwd)/vault-proxy/secrets/secret_id:/etc/vault/secret_id" \
  -v "$(pwd)/vault-proxy/config/vault-proxy-config.hcl:/etc/vault/vault-proxy-config.hcl" \
  -v "$(pwd)/vault-proxy/config/cache-persistence-token:/var/run/cache-persistence-token" \
  -v "$(pwd)/vault-proxy/cache:/var/run/cache" \
  -e VAULT_ADDR=https://{vault_server}:8200 \
  -e VAULT_NAMESPACE={namespace} \
  --hostname $(hostname) \
  hashicorp/vault:1.17.5 \
  proxy -log-level trace -config /etc/vault/vault-proxy-config.hcl
carlzogh commented 3 days ago

Hey @heatherezell @VioletHynes - could you please confirm if this is the expected behavior or if this is something we've misconfigured in the Proxy?

VioletHynes commented 3 days ago

This seems like expected behaviour today. However, I agree it's unfortunate, and it'd be great if we could avoid this. To your point here:

Vault Proxy is able to persist its authentication token and not need to perform a mandatory token lookup / refresh on startup if it is still valid.

This is the crux of the issue, and I'd call this the problem as opposed to the persistent cache itself. This is something that would provide great value with or without persistent caching.

Another big benefit here is that if my Auto Auth token has a 1 day TTL and I restart Agent/Proxy five times a day, we make one token instead of five.

This has been a known potential enhancement for a while, but I'm going to give this a think and see if there's any smart, easy way we can address this. The challenge is of course safely persisting a token, and ideally any way we'd want to do this in a way that provides value to users who are and aren't using caching. I can't promise anything here timeline wise (like I've said, this has been a known FR for a while) but I will promise I'll give this a good think at the very least.

Thanks for the issue and food for thought!

carlzogh commented 1 day ago

Thanks @VioletHynes, that makes sense.

In our use-case we wish to rely on Vault Proxy as a component in our system that provides statically stable access to cached (remote) Vault Server secrets, and unfortunately this is likely a deal-breaker for us. Would you happen to know if there is a workaround for us achieving this outcome today, until "proper" support for this is added to Vault?

Appreciate your help as always!

VioletHynes commented 1 day ago

Proxy isn't designed to be ran in an airlocked environment without any access to Vault ever. While it is designed to be resilient to downtime, it's not designed to be ran in an environment with no Vault access. This kind of feature to me would enable two things:

I don't see this so much as enabling a new use case (I don't know if we want to support Proxy in a completely airlocked environment) as I do bolstering the already strong story Proxy has for resilience to downtime.

The way we'd probably add this is by utilizing the token sink feature, which can already be configured to store a token. If I were implementing this I'd add some kind of option to token sinks that allows Agent/Proxy to check these for tokens before auto-authing. There are some complications that make it not a trivial change, and it's a pretty significant change to their operation.

There's a chance I get some free time to think about what the implementation of this might look like or potentially implementation over the next few days/weeks, depending on how busy I am. If you do want to increase the priority of such a feature, you can do so via our existing support/account management channels.

Thanks, Violet

carlzogh commented 1 day ago

Thanks Violet - to clarify, we're not looking to run in an airlocked environment, however we want to be resilient and allow "offline restarts" of Vault Proxy during temporary downtime / unavailability of Vault Server.