spiffe / spire

The SPIFFE Runtime Environment
https://spiffe.io
Apache License 2.0
1.77k stars 469 forks source link

Deferred certificate issuance #1315

Closed mcpherrinm closed 2 years ago

mcpherrinm commented 4 years ago

For some workloads, it might be undesirable to have spire-agents eagerly fetch certificates as soon as they learn about the registration.

Ideally I would like to specify when a registration entry is created via the API that issuance should be deferred until an agent has successfully attested the workload. Then the agent can fetch the certificate. The tradeoff of avoiding issuance seems worthwhile in some scenarios.

Here are two example use-cases:

We provide human users who log into systems SPIFFE credentials so they may perform administrative tasks by calling services or databases. Most of the time humans do not log into systems, so having certificates always ready to go is not needed. It is sufficient to provision them on-demand.

We run many CI jobs in docker containers. Some small fraction of them needs to call other services, so we want to make sure they have the option of getting a SPIFFE identity. I'd like to avoid issuing certificates until they're requested. Since we use short-lived containers for doing builds, there's significant overhead of re-issuing certificates to each build.

APTy commented 4 years ago

This is a cool idea, particularly around the sparse access of humans to production systems.

elee commented 4 years ago

This may be compelling for our usage of Spire in Kubernetes as a node compromise would be limited to only the workloads resident on the node at the time of attack vs. all SVIDs (resident and non-resident) becoming available.

cc: @gregose @brentjo @gregose as per our call today

mcpherrinm commented 4 years ago

I'm not sure this is exactly what you'd want for that property. An attacker who wanted a non-resident workload's SVID would have to control the local SPIRE agent, and if it can do that, then it could trigger the lazy issuance. Admittedly you would have an issuance log at that point, but it's not the security boundary I'd like.

How to actually get that: Each node runs a spire agent, and you give each agent a unique SPIFFE ID. Your kubernetes integration only registers workloads which are pods actually on the node. I believe the support/k8s-workload-registrar already does this (but haven't verified. If it doesn't, I'll add it. We're looking into this soon, to replace some internal integration glue code we have).

azdagron commented 4 years ago

k8s-workload-register currently registers all workloads against a generic per-cluster node SPIFFE ID.

elee commented 4 years ago

That's helpful context @mcpherrinm @azdagron -- having something {de,}register workloads per kubernetes node seems like a more feasible approach. A few things about having the kubernetes integration register workloads that jump out at me:

  1. it may be a race condition between the integration creating these registration entries and the workloads requesting them
  2. spire agent would have to aggressively poll to prune invalid workload registration entires as they are removed or modified (I think it does this already)
  3. the availability of the agents are now very coupled to the availability of the kube control plane and this registration control loop

I'm not entirely sure how to mitigate (1) at a glance, the other two challenges seem like design decision tradeoffs

azdagron commented 4 years ago

1 seems possible to mitigate by plugging into the Kubernetes Scheduling Framework (https://kubernetes.io/docs/concepts/configuration/scheduling-framework). The "Reserve" and "Unreserve" integration points seem promising. On "reserve", a registration entry could be added. On "unreserve" it could be removed.

azdagron commented 4 years ago

ReservePlugin interface https://github.com/kubernetes/kubernetes/blob/edad4bbfc824215fc254096dfbbd1b2ab8ce6781/pkg/scheduler/framework/v1alpha1/interface.go#L347

UnreservePlugin interface https://github.com/kubernetes/kubernetes/blob/edad4bbfc824215fc254096dfbbd1b2ab8ce6781/pkg/scheduler/framework/v1alpha1/interface.go#L378

evan2645 commented 4 years ago

Ideally I would like to specify when a registration entry is created via the API that issuance should be deferred until an agent has successfully attested the workload. Then the agent can fetch the certificate.

My first impression is that it feels more natural to enable this feature on an agent-by-agent basis, e.g. disable_eager_svid_caching = true. Perhaps that inclination is due to my mental model in which an entry describes a workload and its identity... the behavior in question here is a function of agent logic rather than being anything to do with the workload or its identity itself.

Do you have cases in which exposing this feature as an agent configurable wouldn't quite cut the mustard?

I'd like to avoid issuing certificates until they're requested. Since we use short-lived containers for doing builds, there's significant overhead of re-issuing certificates to each build.

These two statements feel conflicting? Or, is this an argument for per-entry control?

azdagron commented 2 years ago

I think this is being solved to some extent with #2593. Happy to revisit if needed.