hashicorp / vault-helm

Helm chart to install Vault and other associated components.
Mozilla Public License 2.0
1.07k stars 870 forks source link

Support deploying central Vault Agent HTTP Caching Proxy #756

Open Freyert opened 2 years ago

Freyert commented 2 years ago

Is your feature request related to a problem? Please describe.

Many Dynamic Secrets engines can not support a high number of credential requests from replicated workloads. For example, if the Atlas Secrets Engine needed to provision 100 database credentials for 100 pods, this would likely lock any other vital automation in the Atlas environment such as backups or scaling.

The solution to this issue is to run a Vault Agent as a Caching Proxy for credential requests. If all pods use a single k8s service account via the Vault Caching Proxy then the Vault Server only provisions a single instance of the dynamic credential for all 100 pods. The credentials are now "service account scoped" instead of "pod scoped".

Describe the solution you'd like

Preferably, the helm chart would support a k8s Deployment that pushes out a cluster (replicated or not) of Vault Agent proxies behind a k8s service.

Currently https://github.com/hashicorp/vault-helm/pull/749 attempts to add the Vault Agent Proxy as a side care for the CSI storage engine. This provides no benefit for the Vault Injector. A standalone proxy would help both and give operators the control they need to confidently administrate Vault workflows.

Describe alternatives you've considered

Additional Context

Vault Agent Injector

Secrets CSI Provider

Other Technical Advantages In general, I think there are strong reasons to treat the Vault Agent Proxy as a standalone deployment: 1. HA/DR + Deploy multiple instances of a cache with topology aware scheduling to be resilient against zonal failures. + Simpler run books: scale up, restart, for an individual component instead of a coupled component. 2. Monitoring + Monitoring all Injected Agents for the Vault Injector may be untenable for overloaded prometheus instances. + A central cache establishes a good "bottle neck" to monitor the aggregate and then identify the issue. 3. Improve Cache Hit Rates + In large clusters it may be valuable to partition Vault Proxies by application to have smaller deployments with higher cache hit rates. 4. More Generic -> More Use Cases + Building the Vault Agent proxy into the injector or the CSI is a good idea, but a standalone instance can support more use cases. + More use cases means more improvements delivered to a smaller set of files in the code base.
Freyert commented 2 years ago

I was just checking to see if I had missed something, but the StatefulSet does indeed force you to use the vault server command.

New work would be needed to allow deploying Vault Agent. Would also probably be better as a Deployment instead of a StatefulSet.

tomhjp commented 1 year ago

The credentials are now "service account scoped" instead of "pod scoped".

Just to note on this point, to get a cache hit on Agent currently, the token used for logging in has to be the exact same token. But in modern k8s versions every pod gets its own projected service account token with a different TTL/pod owner etc. So to get cache hits from different pods, we'd either have to engineer every pod using the same token (probably not tenable), or implement a feature in Agent that allows a cache hit based on some local token validation and service account matching, or some other similar feature that relaxes the requirements for a cache hit without risking impersonation by attackers.

That's not to say it's not possible, but it's a bit more work than it looks like upfront.