connect: add SPIFFE Workload API and SPIRE-like attestations support

hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.

https://www.consul.io

Other

28.45k stars 4.43k forks source link

connect: add SPIFFE Workload API and SPIRE-like attestations support #6836

Open gpburdell56 opened 5 years ago

gpburdell56 commented 5 years ago

Feature Description

Enhance Consul Connect to support SPIFFE Workload API (https://github.com/spiffe/spiffe/blob/master/standards/SPIFFE_Workload_API.md) and to support attestation processes similar to what’s implemented in SPIRE (https://spiffe.io/spire/concepts/).

I understand that Consul already support the SPIFFE SVID format today. However, Consul does not support SPIFFE Workload API, which mandates the absence of direct client authentication to assign identity to the calling workload.
I think supporting SPIFFE Workload API would simplify Consul configuration management by not requiring to specify service definitions in the config files.
For example, when an application such as Envoy launches, instead of calling Consul Connect for the private key of “app1”, the application, aka the “workload”, would ask “who am I?” to a local Consul agent running in client mode via UDS. The local Consul agent would be responsible to perform an out-of-band authenticity check to determine which identity this “workload” should be assign to and return a matched SVID. On linux, such check could be based on the agent interrogates the node’s kernel to identify the process ID of the caller. The caller and the Consul agent must run on the same node. The process ID would then be used to lookup other info such as k8s kubelet or an AWS Instance identity document. These additional info would be used to match a pre-registered service definition from the catalog.
In this model, the identity of a workload is lazily provisioned at runtime only based on where the workload is running on, and what identity that environment had be registered to.
Personally, I find this model to be compelling because a workload cannot spoof to take on a service identity if it’s running in the wrong environment. In addition, I don't need to embed every details about a service definition in the Consul agent config files.

Use Case(s)

The Workload Attestation and Node Attestation use cases described below expands the traditional role of a service registry to also track "workloads" and "nodes". A "workload" is more similar to a service definition but can also represent batch jobs and serverless functions that are not always on. A "node" maps more closely to a Unix server, a k8s service account, or AWS instance identity. Expanding the service registry scope to track these additional components helps to better managing IT resources across more distributed, dynamic environments.

Workload Attestation - when a workload application such as Envoy calls Consul Connect via the gRPC ADS interface (assuming this is accessible via an agent in client mode), the agent will use the caller's process id to determine a matching registered service definition and return the workload identity as an SPIFFE SVID.

The matching/ workload attestation process is implemented based on a pluggable framework similar to what's implemented in SPIRE. Consul agent clients will be able to use plugins specific to the hosting environment (e.g. Unix, k8s, or AWS) to gather additional information based on the process Id such as uid, gid, or k8s namespace. The information will be compared against metadata in the registered service definitions. Only Consul agent client mode is needed to participate in workload attestation. Consul server should not be needed.

Node Attestation - this process is used for a consul agent client to obtain an SVID for itself from its Consul server. In this model, the consul agent client represents a node identity that's different from the workload processes. To support this, Consul server needs to also support a set of node attention plugins.

I’m interested to work on this feature and eventually submit a pull request. Before I dive in, is this enhancement desirable for the Consul ecosystem? Thanks!

banks commented 5 years ago

Hi @gpburdell56

Thanks for the detailed proposal and offer to contribute.

We've followed the SPIRE and Workload API specs pretty closely and reviewed them a few times and decided not to implement them for a few reasons. One of the biggest ones is complexity, another huge one is that it's very bad practice and a security liability to run consul agents as root which means that the majority of "smart" things the SPIRE node attestation plugins do on unix variants are ruled out - you can't inspect PIDs and doing privileged things with unix domain sockets and similar all become much harder.

What we did choose instead as a stepping stone towards more pluggable methods of authenticating workloads was build AuthMethods into Consul 1.5.0 https://www.consul.io/docs/acl/acl-auth-methods.html. Right now only Kube is supported but there is work planned to add support for Vault and in the process any other system that can provide JWTs in a reasonable format for attesting service identity.

When that's done, it would be possible to build a custom node attestation scheme like SPIREs or anything else you imagine as a third-party utility, provided it can create a JWT that the platform can then use to authenticate with Consul.

That also opens the door to using Vault to attest identity which already has mechanisms to identify workloads based on Cloud provider metadata and many other options.

I’m interested to work on this feature and eventually submit a pull request. Before I dive in, is this enhancement desirable for the Consul ecosystem?

Perhaps but we'd prefer to have it based on an external tool that integrates via JWT/Auth Method at least initially (also pending some planned work on another AuthMethod that allows this). Do you have a need for this workflow and already use Consul?

gpburdell56 commented 5 years ago

Thank you Paul for your thoughtful response. We are interested to use Consul with Envoy, where a Consul client can be used as an Aggregated Discovery Service to configure Envoys and track which ones are actively running in our environments.

I agree we shouldn't require to run Consul agent with elevated privilege. However, our goal here is to only care about a workload app that's directly connected to a local Consul client via UDS, and we should be able to obtain the caller's pid or uid from the other end of the socket without running as root, right? (https://stackoverflow.com/questions/8104904/identify-program-that-connects-to-a-unix-domain-socket)
Thank you for sharing about the existing ACL Auth Methods capability. I was hoping to not require the workload app (e.g. Envoy) to authenticate with Consul first in order to obtain an ACL token. Our design goal is to keep our app config to a minimum such that it does not have an identity until Consul assign it one.
So should I implement a new ACL Auth Method that does not require the workload application to pass in a bearer token? it's only supported via a local UDS/gRPC connection, and the Consul client will resolve the caller's uid from the socket and combine that with additional node specific info to attest with a Consul server for matching service identities.

Thanks! Simon

banks commented 5 years ago

Hi Simon,

Apologies for the length here and any possible repetition. I'm trying to make sure the problem is well understood as I think there is a good opportunity to improve things but we all need to understand the limits of the problem the same way to do that!

...

I think that restricted use-case of authenticating by UDS is interesting. But it's an incomplete solution... more below.

... Our design goal is to keep our app config to a minimum such that it does not have an identity until Consul assign it one.

Consul can't assign identity! It can only convert one form of trust into Consul trust. Even UDS and process user ids are a form of identity distributed by config/operational tooling. More on this below...

So should I implement a new ACL Auth Method that does not require the workload application to pass in a bearer token? it's only supported via a local UDS/gRPC connection, and the Consul client will resolve the caller's uid from the socket and combine that with additional node specific info to attest with a Consul server for matching service identities.

The problem with UDS auth is that it only authenticates the process to another local process. That works in SPIRE only because the local process is already trusted and has an identity rooted in some other node attestation scheme (e.g. AWS meta data).

In Consul, clients are not blindly trusted - they only have permission to do what their ACLs allow. We could build a way to authenticate to the local agent via UDS and process ID but then the local agent would need a permissive ACL that could register any service name it wants and that would need to be stored on disk (because agents can restart) on every host in the DC which pretty much destroys Connect's threat model - a compromised host can always impersonate any identity in the mesh, effectively bypassing all Intentions. You could only assign the agent and ACL with service Identities that are specifically allowed on that agent, but then you are back to where you started - needing a unique ACL for each instance with just the right privs. You just moved it to be in agent config not in the service definition.

So whether we built UDS/process ID auth into Consul clients or into a plugin that issues JWTs for use with Auth Method, you still have the fundamental trust issue of how do you trust that client/plugin is authorised to create an ACL or assign that specific identity in the first place?

It should be pointed out that UDS+Linux User perms is not really "Consul assigning identity" any more than an ACL or JWT is - all of ACL, JWT and Unix permissions are an out-of-band distributed credential that Consul is simply converting into a Connect certificate in some way. So while it seems more magical since all process on linux have user permissions already it's not a fundamentally different pattern, and importantly it only conveys identity in a way that is unverifiable outside of the host which means you still need another form of authentication to know you can trust the process that is verifying the unix permission, as well as some external authorization for what service identities that agent is allowed to register!

JWTs and ACL tokens on the other hand can be verified by servers having passed through an agent which allows them to securely convey trust from the trusted platform that issued them to the Consul servers that actually sign certificates and so distribute identities in the Connect mesh.

How does SPIRE solve this?

SPIRE has the same fundamental limitations as Consul here. Workload attestation is only meaningful once there is a local process that has already gone through node attestation to prove its identity.

But even then if you had a blanket policy that said "any process running as user www can register in Consul as 'web' service" you'd have the same problem as outlined above. A compromised node can simply start a malicious process as whatever UID it wants and so get authenticated even if the host compromised should never have been a web server and so elevate access to anything that web servers can access. The threat model is actually even more subtle here but I won't digress.

The way I can see that SPIRE can protect against this is by allowing "workload selectors" to map a service identity only to specific node identities as well as certain unix users. In practice that means that you might not have a credential on disk to manage like with Consul, but you still have to build tooling that knows which nodes every workload will be on, and then creates the perfect whitelist of node identities in SPIRE to only allow processes on those nodes to get that identity. There might be smart integrations to do this, but it's essentially the same problem as needing a smart integration with Consul to convey the set of trusted identities for a node via a token or similar.

We could build a whole new mechanism that is more like SPIRE for this into Consul where we allow you to centrally configure a white list of agent node names that are allowed to request certs for a given service. Then the UDS scheme with agent could be used in connection with restrictive agent ACLs... but that has just moved the problem from distributing restrictive Consul ACLs to workloads to distributing the same thing to agents (in practice it's just moved which part of the Consul config file it goes in!).

What Next?

I'd still love to hear more about your situation as we are certainly still very actively searching for ways to make trust bootstrapping simpler, but it's a fundamental problem that at some point there needs to be some identity assigned by a trusted platform component that can be used to bootstrap trust in a processes identity.

Are you running on bare metal? VMs? Kube? Nomad? Are your workloads in Containers? If yes are proxies deployed in same network namespace (i.e. pod or container), are Consul clients in the same network namespace? Are there multiple workloads with different service identities on the same VM/host? Do they share a Consul agent? These would all be super useful to set some context around your proposal.

Finally, I should point out that I think we are both searching for the same goal here - the simplest possible trusted setup for services. It's definitely our intention that applications themselves shouldn't have to reason about this, but we do need some about of tooling to integrate with whatever trusted platform is providing the basis of the identity assignment and JWT seems the simplest and most universal interface to do that by for now.

gpburdell56 commented 5 years ago

Hi Paul, Wow, thanks! I agree we share the same goal to simplify trusted setup of services. I work for a large enterprise that currently runs most services on VMs in our own data centers. We are evaluating K8s, and I'm researching how to evolve our tech governance process for both on-premise and public cloud environments. Typically, the workloads on the same VM would assume the same service identity. We are still experimenting with Consul and haven't yet settled on whether each VM would have a dedicated Consul agent.

I agree with your insight that SPIRE vs Consul is like moving the trust problem from "distributing restrictive Consul ACLs to workloads" to "distributing the same thing to (node) agents". I think the key question here is whether it's worthwhile to know which nodes every workload will be on, and then define the whitelist rules to only allow processes on those nodes to get a particular workload identity. It certainly introduces another layer of node attestation complexity in addition to the existing workload attestation.

Our "nodes" are becoming more diverse. We have VM nodes, potentially K8s nodes, and eventually public cloud compute instances. I think we'll need to build tooling to track which nodes every workload will be on regardless of whether that information is used to provision workload identities. We need this information to plan upgrades and cost optimization initiatives.

Looking forward, we are considering to use Envoy as a Data Plane, but we would also need a distributed Control Plane with agents that can configure Envoy. If this agent is deployed locally to each node, then we can perhaps also use it to gather node information and to perform attestation. This is where I'm exploring whether a Consul client can be used for these purposes.

We are also interested to use something like Consul or SPIRE to assign our "enterprise service identities" across our legacy and hybrid cloud environments. I agree that using UDS+Linux User Perms to exchange for an enterprise identity is not fundamentally different from using a JWT to exchange for an enterprise identity. However, looking at https://www.consul.io/docs/acl/auth-methods/kubernetes.html, I feel uncomfortable to bake a token into a config file. What expiration would be configured for this token?

gpburdell56 commented 4 years ago

Hi, Just to update you on my thinking around this topic, I'm still interested to experiment with adding SPIRE-like attestation support to Consul as it also solves some of my enterprise governance objectives (e.g. track what are my active nodes in my hybrid cloud environment and track what workloads are running on those nodes).

The following sequence diagram illustrates what I currently have in mind. I understand this is not materially different from the existing Consul ACL capabilities. If the Consul community one day decides to support the SPIRE-like attestation pattern, is this design the right approach?

I've introduced this additional component called "Governance API" that allows me to implement some of the attestation features outside of Consul (in case my pull request gets rejected :))
I also think Consul ACL Token can still be used to help establish trust between a Consul agent and a Consul server before node attestation completes.
The workload I intent to support is Envoy.

Thanks!

attestation workflows (2)

banks commented 4 years ago

Hi @gpburdell56 thanks for the detailed responses!

Overall your proposal seems like a workflow that could work, but it's significantly different from the workflow Consul is assuming so far.

At a very high level Consul is build around a model where there is some external platform (or platforms but I'll make it singular for brevity) provisioning work. There is some external source of truth about what should run where and what it needs to access. A lot of design in Consul relies on this model in subtle ways so changing that is a bigger step than it might appear at this high level architecture diagram stage 😄. To contrast the two approaches I'll list those assumptions:

Workloads are provisioned by a platform (scheduler, config management, terraform + cloud etc) which is the source of truth for what is running where and it's "identity"
The platform needs to provide credentials to the workload that let it do it's job - that might be accessing a DB or authenticating to Consul
Ultimately service identity in the platform eventually comes back to trusting humans - which humans are allowed to define access rules or setup new services etc. All platforms have already solved this in some way.
Consul assumes you already have a big investment to secure all of this through existing tooling so we want to make it easy for those existing platforms that already know how to encode rules about who can do what into service identities for the mesh.
- Note that if you don't already have the runtime platform locked down, it's hard for us to build any meaningful security in the mesh as there is almost certainly a way to bypass that by reconfiguring something at platform level.

Your proposal has a different workflow more like:

Stuff is running in places, there is no current source of truth for if it should be there.
Gather requests for access from running stuff and get a human to vet them
If approved, grant certificates

I can understand why that workflow might be necessary if you have a huge sprawl of legacy systems that are not all provisioned by one platform etc. but it seems very problematic to me in practice:

This seems like legacy cleanup task rather than a desirable future-looking model. The whole idea behind schedulers and enabling workload portability is to remove humans from the loop so I don't see how this would work realistically for modern architectures, unless you automate the vetting process. And if you do you are basically back to the start of the discussion - how do we know what should be where and if we can trust the request? I don't think it's right to build so much machinery into Consul to help automate this kind of cleanup process - it may well be a legitimate need but it seems better suited to a dedicated tool than to treat it as a first-class workflow for building secure infra to me.
This is the big one: How can the human or system verify that the node_attestation_request is genuine? Notice that this is the same problem we started with! For example if you are trying to use this to attest unique node names that operators can understand then you need cryptographic proof from the provisioning platform that that node was assigned that name. So you've not really solved attestation - that's still needed and tied to the provisioning platform, you've only really made a workflow for humans to observe workloads and easily create rules allowing things that "look OK"
Assuming we built all the machinery described above to have the platform securely attest to the node identities and then the "Governance API" to verify that before human approval. How will a human securely vet a request? This feels very open to social engineering - sneaking in bad workloads that "look plausible" etc. How would that be detected etc?

It seems the problem you are trying to solve for "discovering what is there and building secure policies" is a valid one, but I don't think rewriting Consul's security model around it is the best solution.

For example instead you could run Consul without Connect first to "learn" where everything is, have human operators vet that list and come up with the right policies for those workloads, and then enforce them by provisioning ACL tokens or JWTs to those workloads, following the existing model.

I feel uncomfortable to bake a token into a config file. What expiration would be configured for this token?

Hmm I think it would help if you express your threat model clearly. Often credentials in files are brought up as concerns but if you work through concrete threat models it's easier to understand whether this is a valid design or not. I wrote a few paragraphs about why "on disk" is often something people react to without being able to articulate a threat model where it actually matters but I think you'd understand that.

Note though that even if you never put tokens in a file and only use Consul's API, we still have to write them to disk on every client to support restarting the client without disrupting everything - there is a mode to disable this but you have to jump through an extreme number of hoops to make that work including solving the first-secret problem yourself some other way such that a "trusted" process can retrieve the secrets and deliver them back to consul after a restart etc.

Note that Consul doesn't assume the JWT used for AuthMethod is on disk anyway - to use auth method something has to use a JWT from somewhere to get an ACL token via the API. It's up to the integration whether the JWT or eventual ACL token ever make it to disk.

The interesting part of the question though isn't the "disk" part but is around expiry. That is a great question.

In both cases JWTs and ACLs support expiry. We don't yet support rotating JWTs in Kube although it was designed and is on the roadmap to complete. The issue here is that the trusted entity that provisioned the JWT is the thing that is responsible for re-asserting that it's still valid and supplying a new one in a timely way (to support it having a short TTL), and then the integration would detect this somehow (file watch or API hook etc) and re-perform the login to get a new ACL token with the same short TTL.

Forgive me please if I've still misunderstood part of your proposal - I really appreciate your input on thinking about this.

gpburdell56 commented 4 years ago

Hi Paul, thank you again for the thoughtful feedback!

I am indeed trying to design a governance solution for an environment with platforms and systems of different flavors and generations. They do all run in a trusted zone (e.g. our own data centers today with addition of public cloud soon). However, they are owned and tracked by different teams and different tools. there's not a centralized, up-to-date view of these systems and their dependencies. And the identity mediation required to integrate them has also been a growing pain point.
- I'm exploring this idea of using Envoy and Consul to create such a centralized view.
- I'm also exploring using SPIFFE protocol to provision "enterprise identities" on top of "platform specific identities" since our platforms are becoming more heterogeneous.
From a product road map perspective, I do think Consul has a unique opportunity to help larger IT organizations identify legacy systems and across legacy + modern platforms. But I digress.
I agree with your point on "_how can the human or system verify that the node_attestationrequest is genuine". My design proposal has actually changed a lot since my last posting on 12/4.

Here's the latest version:

At configuration time, I'd like to use the existing Consul workflow to provision agent policies and agent tokens and use the agent tokens to bootstrap my runtime node attestation process. In a way, I'm treating the Consul agent token like an OAuth refresh token used to acquire access tokens (node svid)
when a Consul client starts, it'll perform node attestation and fetch both the node_svid (for itself) and workload_svids it can attest to from a Consul server.
when a workload starts, it'll go to its Consul client to fetch its workload_svid. At that time, Consul client will perform a platform specific workload attestation to select and return the correct workload_svid from its local store.

attestation workflows (3)

I have more detailed sequence diagram of how these operations can be implemented (based on my study of the Consul source code and SPIRE source code so far). However, before we get into that level of detail, does this version of the proposal look promising to be considered into future Consul identity provision workflow?

Thanks! Simon

banks commented 4 years ago

Hi Simon,

From a product road map perspective, I do think Consul has a unique opportunity to help larger IT organizations identify legacy systems and across legacy + modern platforms. But I digress.

Oh we absolutely agree there! I think the questions isn't so much if this general space is a problem Consul can help with and more a question of if it's right to solve it in the specific way you are imagining "online" through a complex and flexible identity management system built into Consul vs. solve it in a way more like my proposal where Consul helps organisations understand what's running and then implement a solution to secure it incrementally by plugging into the relevant bits of platform tooling that are already responsible for managing identity.

You're new proposal seems interesting but still has a couple of impedance mismatches with Consul's current model:

Agent and Workload identities in Consul are much more flexible than this assumes. For example, it's central to Consul's current model that agent's are the source of truth about what is registered where. This is why you register services with a local agent not with the central servers. Step one of your proposal inverts that - there is not service registry entry until the agent is up and has a service registered to it.
- Changing this would massively change many assumptions in Consul as well as the entire UX of deploying services. We've considered it but it's just not clear that it's viable without essentially rewriting Consul from scratch with a totally different set of assumptions.
  - But I don't think your proposal necessarily needs that huge model change - lets continue by assuming that the step one is creating some sort of policy in Consul that is not directly attached to existing registrations but that will be matched against later (FWIW, this is pretty much exactly what AuthMethod Rule Bindings are).
The "agent token" and "node svids" here are essentially exactly the same as out existing "auto-encrypt" feature. We use an ACL token to automate provisioning of an agent-specific SPIFFE certificate. The only thing that certificate proves though is that the requester had access to an ACL with node:write for the node name in the cert (same as your proposal though).
The relationship between agent and service is pretty much possible already - If operators choose to trust all processes that can access localhost API to have a single identity then they can issue agent tokens that encode not just node:write for the node name but also service:write for the authorized service. If they do that then anything that can speak to localhost - even without an ACL token) can obtain a "workload" i.e. Connect cert for that service and the model works with essentially the same UX and threat model you are proposing already.

I think the complications with the proposal are:

Many (most?) users currently don't have a single agent-per-workload model and so they can't just assign the agent a token that allows it to represent all necessary services and then just trust the localhost processes are well-behaved. In your proposal, this would be the same problem unless additional authentication methods are added for workload attestation as we discussed previously. This might be possible but I think it adds a considerable extra complexity in both UX and code and it's not yet super clear to me that it's better than delegating to a trusted platform for workload identity in the first place.
The workload SVIDs you mention as having both a logical service as a SPIFFE ID and a parent ID. How do you propose that works? In SPIFFE spec, SVIDs represent a single identity that must be encoded into a single URI. So workload SVIDs can either be a logical service identity only (e.g. spiffe://<trust-domain>/<namespace>/<service>) or can embed hierarchical information into them, (e.g. spiffe://<trust-domain>/<datacenter>/<node>/<namespace>/<service>). We actually currently have somewhere in between those - we encode DC but not node names. We considered node names too and could add them in if there was a strong benefit but philosophically we took the position that fundamentally identity should be about services not about nodes on which they happen to be running. It seem odd that you'd need to "change identity" if a scheduler stopped an instance and started it again elsewhere or a VM rebooted and came back with a new name/IP for example.

Maybe I could turn it on it's head, ignoring any mapping to SPIFFE stuff your proposal essentially comes down to: Consul agent name should be part of the Connect leaf certificate/identity and access rules should be based on both service name and the agent it's running on. (n.b. you didn't directly propose node-specific intentions so that may seem like a leap but it's essentially the same as having a pre-defined registry of which node/service pairs are allowed and then presenting a certificate with both encoded).

That model gets really hard to manage in a scheduler world where the node a workload is running on is dynamic - a big part of the way we think about service mesh and service identity is about breaking that way of thinking about identity so identity becomes essentially decoupled from where a workload is running. In that model for example you'd then need to either fall back to existing per-service credentials or have an integration with the scheduler such that it creates the node/service mappings in Consul as part of provisioning a workload - that's essentially the same difficulty as having a scheduler integration that provisions a service-specific ACL each time it deploys a workload, and it's strictly worse (in terms of needing centralised Consul calls during workload deployment) than one that uses JWT or some other crypto method to convey that "trust".

Of course we want our solution to work well for legacy systems where that is the practical reality of how organisations think about security, but I still think we can achieve that with as-good-if-not-better UX by solving those discovery problems and then having the right integrations to make it easy to translate business security policies into the correct credentials being distributed to the correct places.

We still don't have full solutions to many parts of this puzzle though so I really appreciate your time and effort thinking this through. Do you still see significant advantages in tying identity to node names for you workflows? If you compare the practical steps needed by your org to roll out based on your proposal and the status quo with Consul, what advantages does your proposal have? I definitely do see some to be clear but they seem relatively minor compared to the amount of work involved in building this and the ongoing education and maintenance cost of having a whole other different identity management options on top of the existing model (which we'd need to keep as your's doesn't cover all use cases and because of backwards compatibility).

My feeling is that most of the benefit here could be achieved with tooling on top of what we have which seems like a much lower risk way to experiment and prove out the UX than core changes to Consul's identity management.

gpburdell56 commented 4 years ago

Hi Paul,

But I don't think your proposal necessarily needs that huge model change - lets continue by assuming that the step one is creating some sort of policy in Consul that is not directly attached to existing registrations but that will be matched against later

Agreed. I really just wanted to create some attestation configurations in step1. I don't think these configs need to be tightly coupled with a service registration. see more in "Proposed Security Group Implementation" in diagram below, of which I basically borrowed from SPIRE's implementation.

...then they can issue agent tokens that encode not just node:write for the node name but also service:write for the authorized service. If they do that then anything that can speak to localhost - even without an ACL token) can obtain a "workload" i.e. Connect cert for that service

I agree that the "workload SVIDs" I'm trying to deliver is already supported via Connect Service Leaf Certificate. I think the key question here, that I haven't satisfactorily addressed, is what advantage does adding attestation process offer over the existing ACL token process? I think there are three.

More governable security group model

I took a step back to research more about the relationship between the agent and the service, and I think it's actually better to model this relationship based on that of a security group: a group of nodes that's is authorized to issue a particular service identity.

IMO, Consul has implicitly implemented this security group model via node acl tokens and service acl tokens (see "Security Group Implementation via ACL Tokens"). The challenge with this implementation is if I just look at my ACL policies alone, it's hard to understand which nodes are authorized to run what services.
In SPIRE's implementation of this security group model, each service identity (workload SVID) can reference a parent identity in a SPIRE registration entry. I use to think this parent identity just represents a hosting node, but it turns out a parent SPIFFE ID can represent a logical security group. An agent can be issued multiple "security group" SVIDs if it's been delegated to issue multiple kinds of service identities.
I'm now proposing to essentially add a similar "attestation configuration" to Consul to better capture this security group relationship so it can be better audited and governed centrally at the enterprise level.

Adding more nodes does not require more policy configuration
- if I add a new hosting node to my environment, would I need to generate another agent ACL token with a unique node name?
- with an attestation configuration, I think I would always have one entry per service identity, and I would just need to update the selector of that entry with the additional nodes.
Updates to a security group does not require restart
- I could be wrong about this, but if I no longer want a node to issue a particular Connect leaf certificate, would I have to update its agent ACL token and restart the node?
- with an attestation design, if I modify an attestation configuration, I could in theory trigger a rerun of the node attention process, which would redeploy (and revoke) the node SVIDS and workload SVIDS.

Many (most?) users currently don't have a single agent-per-workload model...

I don't want to be constrained to this as well. if a service identity's parent identity (in the attestation config) represents a security group, then multiple agents can be delegated to issue for that service identity as long as it can assume that parent identity.

...Consul agent name should be part of the Connect leaf certificate/identity and access rules should be based on both service name and the agent it's running on.

No. I agree we don't want this

deployment model

banks commented 4 years ago

Thanks Simon, this is a really detailed and thoughtful proposal.

I don't disagree that it has merit, but I'm still not sure about whether it makes sense for Consul - I'll raise this for discussion internally though we certainly aren't done with ACL/workload identity UX and will be making some changes.

The things I remain unsure about:

How to trust the SG assignment?

I'm still not clear in your model how agents can prove that they are authorized to represent specific security groups. Do they still have a token or secret of some kind? Do we need plugins for every platform that can do that via e.g. signed instance metadata? How can we establish trust between the trusted platform (i.e. scheduler or cloud or config management/provisioning tooling) such that we can issue security group identities securely?

Security Groups vs Agent Tokens

While it's somewhat true that you can fit the security group model onto the current agent ACL token, in practice that's not how it's really designed to be used. Agent ACL tokens should generally not allow any service registrations inherently in a Connect setting unless you consider everything on that host to be "trusted" and have the same identity (i.e. it's a single service on a single VM with possible auxiliary process like log shippers that are "trusted"). If you assign service:write via an agent token, then any process on that machine can "act" as that agent, acquire a certificate for it and so on.

This could be where SPIFFE style workload attestations come in where the agent must be able to verify the UID of the process etc. but that still feels problematic to me - many setups (e.g. Kube) make that pretty hard, even where it's possible it can require running the agent as root which we really want to avoid.

For service identity we expect every service instance to register with it's own token which is the basis of trust in that service's identity. This is very different because even if multiple services share an agent and the agent learns all of those tokens, the separate workloads still can't act as each other or obtain certificates for the other etc. It works everywhere without root agents, without platform specifics like UDS and process UID checks and it keeps Consul simpler.

Finally, the "Security groups" pattern you propose here still seems to assume a non-scheduler model where operators manually assign workloads to nodes. In a scheduler environment it seems like you'd just need to have every agent in your Kube/Nomad cluster assigned to every security group because you don't know where the workloads will end up. You could have an integration where the scheduler will dynamically provision security group entitlement to nodes as it makes allocation decisions but that is much more complicated than having it just disseminate trust via tokens or something similar directly into the workload like in the current model.

While we believe firmly that most orgs are going to have workloads in multiple runtimes for a long time to come, adopting such a complex model and workflow that basically doesn't work with schedulers seems like the wrong move to me in the current climate! I'd rather find a pattern that works great at a lower level in both places and then provide tooling to enable workflows like this in more specific environments.

Complexity

Overall, I really like your goals here of making identity management and auditing easier for Enterprises - I agree we need to do something to improve that. My feeling is that this proposal adds a lot of additional complexity to Consul though which is concerning because Consul is already extremely complex.

That's why I prefer to think about ways Consul itself can remain relatively agnostic to the process chosen to disseminate identity because every Organisation will have a different set of requirements there and a different set of platform integrations to consider. That's why we are pushing more towards AuthMethods as a way to bridge the gap - Consul provides enough hooks that other more flexible ways to authenticate workloads and provision identity can be built that do meet needs like this, while not taking on all of this extra code etc in it's core.

That said, we certainly need to improve those hooks to support wider integrations, and improve the integrations to cover more common platforms and workflows.

I'd still love to see whether the workflow you are imagining here can be achieved using AuthMethods perhaps with some modifications. I think this kind of workflow could be possible with Vault and Vault Agent which already has support to auto-attest nodes from cloud provider credentials etc and already has well established support for Audit logging and assigning policies and rules etc.

Would you be interested to help work out how to make this workflow practical via Vault? I'm sure there are edges that would need changes but I feel like that's a more promising route that building a whole new identity model into Consul to support this specific workflow.

gpburdell56 commented 4 years ago

Hey Paul, I'd be glad to help with exploring this workflow via Vault. I gotta read more about Vault first, and then I'll post another updated version of this design proposal with Vault in mind. I think it's easy to move node attestation out of Consul. I'm not sure yet if we can move workload attestation out of a local Consul agent.

Here are some responses to your previous feedback:

Complexity

I agree adding more complexity to Consul is undesirable. I think concepts like Security Group ought to be managed via a more centralized API, and I don't think we should use Consul to build Security Groups as Consul agents are meant to be distributed. I do think Consul can be used to enforce Security Groups after they are defined by a central component (like Vault?)

Security Group vs Agent Tokens

If you assign service:write via an agent token, then any process on that machine can "act" as that agent

If we do incorporate the SG concept, I would recommend to use an agent token as more of a "bootstrap token" that would just contain node:write without a service:write. this token can be used during node attestation to authenticate an agent to be granted a set of Security Group SVIDs. Conceptually, a service ACL token could be also used as bootstrap token as one of the ways to authenticate a workload during workload attestation.

For service identity we expect every service instance to register with it's own token

I think we can still adhere to this guideline by using a service ACL token as the authentication mechanism for workload attestation. this would also reduce the complexity of implementing workload attestation with platform specific authentication plugins on Day One.

the "Security groups" pattern you propose here still seems to assume a non-scheduler model

I did consider a scheduler model (as illustrated in the k8s deployment diagram), but I am also assuming that a scheduler would only schedule a pre-defined (authorized) set of workloads on a given hosting cluster. in k8s, a consul client agent would be deployed as a daemonset onto a k8s node. if the node runs different workloads (e.g multiple namespaces), then this agent will need to attest for a security group svid for each namespace. in this case, the k8s Security Account Token presented by the client agent will be associated with multiple security group ids.

How to trust the SG assignment

agents can be authenticated during node attestation by the following means. node attestation will need to support platform specific plugins to support each authentication mean. And I agree we don't have to implement node attestation within Consul to reduce its complexity.

vm host: use Consul agent ACL token with node:write as a bootstrap token
k8s node: use k8s Security Account Token assigned to the Consul agent process
aws ec2 instance: use aws instance identity document that can be obtained locally by the agent

Enhancing AuthMethods

I do think the existing AuthMethods (aka ConnectAuthorize()?) can be enhanced to support workload attestation. My main use case is to use it as part of an Envoy SDS Check() call to generate a JWT bearer token signed by a workload SVID. The diagrams below shows my current notes on changes that could be made to Consul to add workload attestation support. (blue means new stuff to be added). I have another diagram showing how Consul could be updated to integrate with node attestation, but I'll save that for another time.

consul workload attestation via ConnectAuthorize

Final Thoughts

If we use agent ACL token for node attestation, and service ACL token for workload attestation, then what's the value-add for changing anything at all?

Well, I do think incorporating Security Group modeling would better capture the implicit parent-child relationship between node ACL policies and service ACL policies. this could be useful for governance and auditing purposes. Security Group management can be implemented outside of Consul.
I could also revoke the identity of service instances from one node without affecting all nodes by updating the Security Group registration.

So is it correct to assume that you are interested to explore adding Security Group management workflows via Vault?