Support deploying central Vault Agent HTTP Caching Proxy

Freyert commented 2 years ago

Is your feature request related to a problem? Please describe.

Many Dynamic Secrets engines can not support a high number of credential requests from replicated workloads. For example, if the Atlas Secrets Engine needed to provision 100 database credentials for 100 pods, this would likely lock any other vital automation in the Atlas environment such as backups or scaling.

The solution to this issue is to run a Vault Agent as a Caching Proxy for credential requests. If all pods use a single k8s service account via the Vault Caching Proxy then the Vault Server only provisions a single instance of the dynamic credential for all 100 pods. The credentials are now "service account scoped" instead of "pod scoped".

Describe the solution you'd like

Preferably, the helm chart would support a k8s Deployment that pushes out a cluster (replicated or not) of Vault Agent proxies behind a k8s service.

Currently https://github.com/hashicorp/vault-helm/pull/749 attempts to add the Vault Agent Proxy as a side care for the CSI storage engine. This provides no benefit for the Vault Injector. A standalone proxy would help both and give operators the control they need to confidently administrate Vault workflows.

Describe alternatives you've considered

I've looked at a lot of "operators" that make K8S secrets from Vault, but that introduces a lot of moving parts and we lose the air gapped environment Vault is aiming to provide.

Additional Context

Vault Agent Injector

https://github.com/hashicorp/vault-k8s/issues/331
- Wants to be able to configure telemetry on each agent in a pod. Leads to a bunch of low value time series for Prometheus, etc. A central proxy would be easier to configure and provide higher value time series.
https://github.com/hashicorp/vault-k8s/issues/49
- Vault Agent proxy provides an extra redundancy that can be used on top of a HA Vault.
- HA Vault is great, but is still vulnerable to misconfigurations.

Secrets CSI Provider

https://github.com/hashicorp/vault-csi-provider/issues/150
- CSI Provider doesn't respect auth token TTLs and makes too many requests
https://github.com/hashicorp/vault-csi-provider/issues/151
- Almost a duplicate of the above, but also indicates issues with secret rotation (on top of auth renewal).
https://github.com/hashicorp/vault-csi-provider/issues/149
- I believe the issue indicates that in a deployment with 50 pods, 50 credentials are deployed, but the CSI only uses the last credential deployed. Proxy would only provision 1 and the CSI would use that.

Other Technical Advantages

In general, I think there are strong reasons to treat the Vault Agent Proxy as a standalone deployment: 1. HA/DR + Deploy multiple instances of a cache with topology aware scheduling to be resilient against zonal failures. + Simpler run books: scale up, restart, for an individual component instead of a coupled component. 2. Monitoring + Monitoring all Injected Agents for the Vault Injector may be untenable for overloaded prometheus instances. + A central cache establishes a good "bottle neck" to monitor the aggregate and then identify the issue. 3. Improve Cache Hit Rates + In large clusters it may be valuable to partition Vault Proxies by application to have smaller deployments with higher cache hit rates. 4. More Generic -> More Use Cases + Building the Vault Agent proxy into the injector or the CSI is a good idea, but a standalone instance can support more use cases. + More use cases means more improvements delivered to a smaller set of files in the code base.

Freyert commented 2 years ago

I was just checking to see if I had missed something, but the StatefulSet does indeed force you to use the vault server command.

New work would be needed to allow deploying Vault Agent. Would also probably be better as a Deployment instead of a StatefulSet.

tomhjp commented 1 year ago

The credentials are now "service account scoped" instead of "pod scoped".

Just to note on this point, to get a cache hit on Agent currently, the token used for logging in has to be the exact same token. But in modern k8s versions every pod gets its own projected service account token with a different TTL/pod owner etc. So to get cache hits from different pods, we'd either have to engineer every pod using the same token (probably not tenable), or implement a feature in Agent that allows a cache hit based on some local token validation and service account matching, or some other similar feature that relaxes the requirements for a cache hit without risking impersonation by attackers.

That's not to say it's not possible, but it's a bit more work than it looks like upfront.

hashicorp / vault-helm

Support deploying central Vault Agent HTTP Caching Proxy #756