spiffe / spire

The SPIFFE Runtime Environment
https://spiffe.io
Apache License 2.0
1.79k stars 473 forks source link

SPIRE Server Scalability #533

Closed APTy closed 5 years ago

APTy commented 6 years ago

Creating an issue to track a recent convo @evan2645 and I had about "chaining" multiple SPIRE servers together. We may eventually want to spin this out into its own doc.

Description This proposal seeks to extend the SPIRE Server in order to solve a set of scalability and availability concerns. It's easiest to illustrate with an example:

  1. A centralized SPIRE Server A (trust domain: aws.example.org) maintains/rotates a root key on a annual basis.

  2. Branched SPIRE Servers B, C, and D (also aws.example.org) may live in different availability zones, and maintain/rotate their intermediate keys on a shorter (say, weekly) cadence.

  3. Each branched SPIRE Server would have its own set of SPIRE Agents managing Workloads. a. If a single SPIRE Server can handle say, 1,000 workloads, then this centralized-branching model theoretically squares the number of possible workloads to 1,000,000. b. This is because SPIRE Server A manages its 1,000 workloads, which are simply downstream SPIRE Servers B, C, .... n, each of which has its own 1,000 workloads.

What does this solve? Scalability: This creates a path for SPIRE server to scale horizontally, which has nice properties for increasing capacity of the system.

Availability: By removing the central SPIRE Server A from the critical issuance path of most CSRs, each branch can operate independently, which provides better availability to the system.

Proposed work

  1. Create an UpstreamCA plugin that can make CSRs to an upstream SPIRE server.

  2. Update the registration entry schema to allow for a CA workload to be registered

  3. Update the CSR authorization logic to allow signing of certificates with the CA bit set

  4. Add ability to register a CA workload through the spire-server CLI

evan2645 commented 6 years ago

Hey @APTy

This is definitely an interesting proposal! To summarize, you'd like SPIRE servers to be arranged hierarchically within a single trust domain, correct? I wonder if this work can also realize hierarchical arrangement of different trust domains, but perhaps that is a problem better left until later.

Since these servers would be in the same trust domain, I imagine you would expect workloads being managed by Server B to be able to authenticate (by default) workloads which are managed by Server C? We currently have an upstream_bundle configurable which includes upstream CA certificates in the bundle that SPIRE server distributes, but I don't think this alone will be good enough. Workloads will need to know the chain for their SVID in order to serve the correct intermediates... The Workload API can deliver intermediate chain along with an SVID, but this functionality is currently not implemented... so probably we'd need to add a step 5 on proposed work Update the Workload API to deliver correct intermediate chain when upstream_bundle is in use.

SPIRE server (technically) supports HA when backed by a distributed datastore. This approach however tightly couples the instances, and they would of course share a failure domain as it relates to the datastore layer... I assume you are looking for looser coupling than this?

One of the unsolved problems in SPIRE HA is secure bootstrap and management of auto-scaling SPIRE servers. We have mulled solutions which involve new SPIRE servers going through node attestation in order to be issued signing certificates instead of leafs. Perhaps the proposed work can address this problem as well.

myidpt commented 6 years ago

subscribe

deepak-vij commented 6 years ago

Hi, folks we are trying to use the hierarchical SPIRE server design as part of our edge cloud Identity Management & Authentication mechanism. Just as a quick background, a typical edge cloud environment may include a centralized/datacenter cloud along with 1000s of edge clouds. Each of these edge clouds may be associated to a nearby gateway server. The connectivity between edge clouds/gateways and centralized cloud may be intermittent. In such case, edge clouds/gateways may work in offline mode.

In order to manage “Availability” in such an environment, we are thinking of doing the following: • To have main SPIRE server deployed along with the Root CA at the centralized cloud level. • Each gateway server will host the intermediate SPIRE server along with the intermediate CA. • The intermediate SPIRE is the actual SPIRE server that issues out workload identities (SVIDs). • In order for services to communicate with other across intermediate SPIRE servers, trust chain is established to the Root CA. All intermediate SPIRE servers trust the Root CA.

With that as a brief background, based on my understanding by reading through this thread, the only thing that currently does not exist is the Workload API to deliver the correct intermediate trust chain when “upstream bundle” option is in use.

“Upstream bundle” configuration option already allows inclusion of upstream CA certificates (in this particular case, Root CA) in the bundle that intermediate SPIRE server distributes.

It would great to get your feedback on all this. Thanks.

Regards, Deepak Vij

evan2645 commented 6 years ago

• Each gateway server will host the intermediate SPIRE server along with the intermediate CA. • The intermediate SPIRE is the actual SPIRE server that issues out workload identities (SVIDs). • In order for services to communicate with other across intermediate SPIRE servers, trust chain is established to the Root CA. All intermediate SPIRE servers trust the Root CA.

I assume this would all be a single trust domain, correct? A workload under gateway server A will present gateway server A's intermediate, and a workload under gateway server B will present gateway server B's intermediate, such that both workloads live in the same trust domain and trust is established through use of the common root (i.e. the centralized SPIRE cluster)... is that what you're thinking?

With that as a brief background, based on my understanding by reading through this thread, the only thing that currently does not exist is the Workload API to deliver the correct intermediate trust chain when “upstream bundle” option is in use.

Yes this is work we will need to do... We have had more discussion about it recently, and I think we have a workable solution, just need to get some cycles to implement it.

There is a second piece of work that would be required, which is extending SPIRE server so that it can attest other SPIRE servers, and issue them intermediate CA certs (instead of only issuing leaf certificates). I think we have some idea on how that could work, but it's a little less fleshed out than fixing the upstream chain logic.

“Upstream bundle” configuration option already allows inclusion of upstream CA certificates (in this particular case, Root CA) in the bundle that intermediate SPIRE server distributes.

The current option unfortunately won't work the way you'd expect it to, because it lumps all the certificates (upstream root + all intermediates) into a single bundle. Instead, we need to track the intermediates appropriately, and serve them up to workloads as part of a chain.

deepak-vij commented 6 years ago

Thanks Evan for your detail response. Yes, our deployment at this time is using a single trust domain. Support for multiple autonomous domains (federation) we plan on doing that subsequently.

Everything you mentioned perfect sense. Yes, we do intend to build all this out. I will reach out to you to discuss more. Thanks.

azdagron commented 5 years ago

This feature is code complete but the built in plugin is not currently enabled in master. We need to do some more end-to-end testing before we turn it on.

APTy commented 5 years ago

@azdagron good to close this as resolved?

evan2645 commented 5 years ago

Woohoo, this has been done for a while now... thanks all!