Feature Request: Make the spire server's node selectors consistent with the real state

kongweiguo commented 1 year ago

Now, the node attestation flow seems to be a one-shot action. After the node attestation procedures, spire server side node attestation plugin will emit some selectors back to spire server. It seems, those selectors will never be changed/updated until the next node attestation which is only trigged at the begining of the spire agent.

Firstly, from a security architecture perspective, Security Attestation should be continous.

Secondly, spire server's workload entries scope should be accurate. Especially in large-scale production scenarios, in order to meet the scheduling needs of the business App, the labels and taints of the node/agent may be changed frequently.

So I want to request a mechanism to make the node selectors in the spire server to keep up with real world's changes in time. There's some scenarios:

In k8s, when the node taints or labels change, the spire server selector should be updated timely.
In the classic ec2 / ecs scenario, the node authentication plugin can obtain labels/tags of the nodes from the Public Cloud provider to create node selectors, but after the node attributes change, those selectors should also timely update.

Proposal: The spire agent provides a new interface to the plugin so that the plugin can actively trigger Attestation. The plugin is responsible for sensing environmental changes.

rturner3 commented 1 year ago

Hi @kongweiguo, I think this would be a great capability to have. We have discussed doing something similar for workload attestation in the past as well (#2666).

In order for this to work effectively, the server would need some way to signal to agents that they need to re-attest, as well as a way for the server to enforce that agents must re-attest at a certain point.

An important point to consider is that node attestation is also not always safe to repeat automatically depending on the node attestor plugin because some node attestors have a "trust on first use" (TOFU) design (e.g. aws_iid, gcp_iit). We wouldn't want to have this automatic re-attestation performed for the TOFU node attestors, since it wouldn't be a safe operation, since for example any process on an AWS VM could fetch the local IID from the IMDS and impersonate SPIRE Agent. All of the builtin node attestors that have this TOFU principle have the CanReattest field set to false in the NodeAttestor interface: https://github.com/search?q=repo%3Aspiffe%2Fspire%20CanReattest&type=code

This gets a little more complicated because of the fact that users can provide SPIFFE ID path templates in the NodeAttestor plugins, so it's possible a re-attestation could cause the agent to receive a new SPIFFE ID based on the newly discovered node selectors. We would also want to make sure that old new node selectors get cleaned up in the datastore.

All that being said, I think we would want to do some more detailed design first, considering some of the points I mentioned. Is this something you're interested in working on, @kongweiguo?

kongweiguo commented 1 year ago

@rturner3 Sure, I am interested in working on this line. I am glad that we both agree with this is a good feature we should have.

Infact, I've been trying to build some outside systems to solve this problems. Becaouse of no mechanism we could use, basically, it's a hard/hack/trick work and have too many dependencies on our other internnal systems. That would be a great help for applying the SPIRE in production environment if its self have the mechanism.

Also, I really agree with you, we should start with some basic designs.

kongweiguo commented 1 year ago

I think, maybe, I could do some initial design and post here. Or, how do you think where should we start with?

kongweiguo commented 1 year ago

Perhaps the continuous attestation mentioned above is not the whole picture of the requirment, I think there should be two aspects:

From a security perspective, there should be a continuous attestation mechanism to make sure the agent is always in a secure environment. But, this would not work in the TOFU scenarios.
From an operation and management perspective, we also need a feature/mechanism that is convenient for an outside system to adjust the NODE SCOPE of workload entry dynamically. So that the workloads SVID could be distributed/fetched to the right NODE agent in time.

rturner3 commented 1 year ago

From an operation and management perspective, we also need a feature/mechanism that is convenient for an outside system to adjust the NODE SCOPE of workload entry dynamically. So that the workloads SVID could be distributed/fetched to the right NODE agent in time.

I think the way this could be done today would be to have a registrar service that monitors node inventory and updates registrations as needed. Just to clarify, do you see any gaps with that approach that would lead to new requirements related to this proposal?

Would you mind offering a brief description of the processes illustrated in your diagram? I also wasn't sure what "ROT" and "OPS System" represented.

rturner3 commented 1 year ago

Hi @kongweiguo, just checking back in to see if you were planning to revisit this issue again regarding some of the open questions in previous comments?

evan2645 commented 1 year ago

Hey @kongweiguo - this seems like a great improvement, but feels like there's quite a few things we need to think through in order to implement it.

We're going to move it to the backlog as unscoped ... please let us know if you're able to help push this work forward in the near term

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open for 365 days with no activity.

spiffe / spire

Feature Request: Make the spire server's node selectors consistent with the real state #4378