Closed mcpherrinm closed 2 years ago
This is a cool idea, particularly around the sparse access of humans to production systems.
This may be compelling for our usage of Spire in Kubernetes as a node compromise would be limited to only the workloads resident on the node at the time of attack vs. all SVIDs (resident and non-resident) becoming available.
cc: @gregose @brentjo @gregose as per our call today
I'm not sure this is exactly what you'd want for that property. An attacker who wanted a non-resident workload's SVID would have to control the local SPIRE agent, and if it can do that, then it could trigger the lazy issuance. Admittedly you would have an issuance log at that point, but it's not the security boundary I'd like.
How to actually get that: Each node runs a spire agent, and you give each agent a unique SPIFFE ID. Your kubernetes integration only registers workloads which are pods actually on the node. I believe the support/k8s-workload-registrar already does this (but haven't verified. If it doesn't, I'll add it. We're looking into this soon, to replace some internal integration glue code we have).
k8s-workload-register currently registers all workloads against a generic per-cluster node SPIFFE ID.
That's helpful context @mcpherrinm @azdagron -- having something {de,}register workloads per kubernetes node seems like a more feasible approach. A few things about having the kubernetes integration register workloads that jump out at me:
I'm not entirely sure how to mitigate (1) at a glance, the other two challenges seem like design decision tradeoffs
ReservePlugin
interface
https://github.com/kubernetes/kubernetes/blob/edad4bbfc824215fc254096dfbbd1b2ab8ce6781/pkg/scheduler/framework/v1alpha1/interface.go#L347
UnreservePlugin
interface
https://github.com/kubernetes/kubernetes/blob/edad4bbfc824215fc254096dfbbd1b2ab8ce6781/pkg/scheduler/framework/v1alpha1/interface.go#L378
Ideally I would like to specify when a registration entry is created via the API that issuance should be deferred until an agent has successfully attested the workload. Then the agent can fetch the certificate.
My first impression is that it feels more natural to enable this feature on an agent-by-agent basis, e.g. disable_eager_svid_caching = true
. Perhaps that inclination is due to my mental model in which an entry describes a workload and its identity... the behavior in question here is a function of agent logic rather than being anything to do with the workload or its identity itself.
Do you have cases in which exposing this feature as an agent configurable wouldn't quite cut the mustard?
I'd like to avoid issuing certificates until they're requested. Since we use short-lived containers for doing builds, there's significant overhead of re-issuing certificates to each build.
These two statements feel conflicting? Or, is this an argument for per-entry control?
I think this is being solved to some extent with #2593. Happy to revisit if needed.
For some workloads, it might be undesirable to have spire-agents eagerly fetch certificates as soon as they learn about the registration.
Ideally I would like to specify when a registration entry is created via the API that issuance should be deferred until an agent has successfully attested the workload. Then the agent can fetch the certificate. The tradeoff of avoiding issuance seems worthwhile in some scenarios.
Here are two example use-cases:
We provide human users who log into systems SPIFFE credentials so they may perform administrative tasks by calling services or databases. Most of the time humans do not log into systems, so having certificates always ready to go is not needed. It is sufficient to provision them on-demand.
We run many CI jobs in docker containers. Some small fraction of them needs to call other services, so we want to make sure they have the option of getting a SPIFFE identity. I'd like to avoid issuing certificates until they're requested. Since we use short-lived containers for doing builds, there's significant overhead of re-issuing certificates to each build.