hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.22k stars 4.41k forks source link

Improve the Vault CA Provider's resiliency for leaf signing #11685

Open GordonMcKinney opened 2 years ago

GordonMcKinney commented 2 years ago

Feature Description

Add a resiliency feature to the Vault CA provider to sign leaf certificates without using Vault.

Use Case(s)

Consul Connect client agents require a new leaf certificate (auto_enrypt) prior to joining the service mesh. When Vault is unavailable the client will not receive a leaf certificate from the server cluster's leader. Service owner scale-up events will not increase capacity due to the failure, seriously impacting user experience.

Analysis:

The following documentation indicates a datacenter generates its own intermediate CA. The side effect is a highly resilient architecture.

This is not the case with the Vault CA provider. The leaf signing function has a real-time hard dependency on Vault for all leaf certificate signing.

I am proposing the same behavior as the built-in CA provider to generate the intermediate signing certificate (signed by the Vault root CA). And to use the same for all leaf signing requests. This would improve resiliency and substantially reduce latency for new client agents.

dnephin commented 2 years ago

Thank you for opening this issue!

With the Vault CA provider the private key is stored in Vault. Consul does not have access to the private key used to sign leaf certificates, so I don't think it is possible for Consul to sign leaf certificates without access to Vault. It is my understanding that this is the primary reason to use the Vault provider, the added security of storing the private key.

I think you make a great point about requiring access to a Vault instance that may be in a different region. I opened #11159 a little while ago, which I think might address this problem. That would allow secondary DCs to use a separate Vault instance, one that is in the same region as the Consul server. I believe that would make the statement in the docs more accurate.

What do you think of that approach? Would that address the problem sufficiently?

GordonMcKinney commented 2 years ago

I like the idea proposed in #11159, but it requires access to Vault for signing the leaf certificates. That signing process is in the critical path of a service's initialization process. We have instances where Vault partitions due to FW ACL misconfiguration and AWS DX link issues occur.

In my mind the root CA remains in Vault with a hidden private key. That root CA can sign CSRs, for example, a server intermediate CSR. Thus, you create in-memory intermediate key pair for signing leaf certificates. Decoupling leaf signing from Vault increases resiliency and aligns with the architectural goal outlined here.

GordonMcKinney commented 2 years ago

Plus, Consul has code to do this already:

The Vault CA provider needs to send the Intermediate CSR to Vault instead of applying it to the built-in CA.