spiffe / spire

The SPIFFE Runtime Environment
https://spiffe.io
Apache License 2.0
1.78k stars 468 forks source link

Address cases of unnecessary re-signing requests by Agents #2639

Closed amoore877 closed 1 year ago

amoore877 commented 2 years ago

Original problem

In operations, instead of deleting registrations when no longer needed it can be helpful or safer to set Expiry on registrations, taking advantage SPIRE Server's pruning logic.

However in this model every time a registration is renewed, the Revision Number of the entry is incremented. This makes all Agents tracking the registration call the Server for re-signings. In the case of heavy usage of node aliasing, this can result in a stampede of calls.

This re-signing is a waste of resources, causing unnecessary load. For widely-distributed identities, this actually causes noticeable decrease in SPIRE Server performance as the Agent stampede comes in. Issued identities are not affected by the registration's expiry (which is somewhat warned about here by Please note that this is a data management feature and not a security feature).

The waste could be exacerbated into DoS if an SPIRE Admin's renewal process misbehaves and starts making numerous unnecessary expiry updates.

Proposal

There may be other cases (which we should discuss and list out in this issue) where it is not necessary to use resources to get re-signings due to a registration update.

Either:

The first suggestion could also be in the migration plan of getting to the second.

azdagron commented 2 years ago

Thanks for opening this, @amoore877. I think we want to figure out something to do here. Whatever we land on, I think maintaining a few properties is going to be important:

  1. Agent should remain agnostic as to what attributes about a registration entry influence the shape of an SVID signed over that entry.
  2. The entry needs a counter that is always incremented whenever the entry is changed. This property is important for some future looking features we've considered adding that would need the read-modify-write cycle of an entry to happen outside of a database transaction which rely on using the counter to detect and resolve conflicts on update.

Both of these properties are currently fulfilled with the revision_number field. The first proposal makes me uncomfortable even if we didn't want these properties, since even from a human operations perspective, we generally expect a revision number to change when something is updated.

The second proposal is interesting but also prone to error, especially in mixed deployments where two or more servers, running disparate versions of the software, might have a different idea about what fields would require a re-signing.

I think the difficulty in solving this is indicative of a smell in the way our registrations are currently shaped, where we have metadata about the entry intermixed with the entry itself. This isn't the first time we've run into problems related to that. Alas, acknowledging that doesn't help us fix this in the short term.

I am also curious how the signing load you describe compares to the signing load in your deployments due to new registrations, which would also have the stampeding herd effect you describe, if I understand your registration process correctly.

I probably need a little to think on this to see if there is an alternate solution that still maintains the two properties I described without being impacted by different versions of software being present in the deployment.

amoore877 commented 2 years ago

Agent should remain agnostic as to what attributes about a registration entry influence the shape of an SVID signed over that entry.

Definitely agree; agent should be concerned with dedicated metadata and not make decisions beyond that when it comes to re-signing. Server should be in charge of the dedicated signal(s) agents observe.

The entry needs a counter that is always incremented whenever the entry is changed. This property is important for some future looking features we've considered adding that would need the read-modify-write cycle of an entry to happen outside of a database transaction which rely on using the counter to detect and resolve conflicts on update.

It sounds like we need to resolve this "registration entry shape" problem sooner then. Adding more consumers of entry properties is going to make complexity and reliability of an entry re-design increase significantly.

The first proposal makes me uncomfortable even if we didn't want these properties, since even from a human operations perspective, we generally expect a revision number to change when something is updated.

At least from my human perspective, I see the revision number as only applicable to the "useful" information about an entry- the security aspects of ID, Parent, selectors, federation, etc. I would not expect revision number, at least as it exists today, to be applicable to an entry property we explicitly call out in documentation as a data management feature entirely internal to the Server's logic.

The second proposal is interesting but also prone to error, especially in mixed deployments where two or more servers, running disparate versions of the software, might have a different idea about what fields would require a re-signing.

I think that's a case that operators would need to be aware of when updating between Server versions, either:

I am also curious how the signing load you describe compares to the signing load in your deployments due to new registrations, which would also have the stampeding herd effect you describe, if I understand your registration process correctly.

I expect the stampede effect for a new registration is the same as renewing, assuming the same breadth of distribution across Agent instances. I say "I expect" because we recently enabled this mass renewal logic, and so because this was the first time making use of it essentially most registrations all came up for renewal at the same time.

Much like my example of a buggy registrant misbehaving in renewal logic potentially indirectly DoS'ing Servers, a buggy registrant could certainly also DoS Servers (and Agents) by creating numerous unnecessary registrations. However in the case of a renewal the entire re-signing is not necessary and thus we could avoid that DoS completely.

rturner3 commented 1 year ago

Maintainers discussed this and feel that it's not something we want to try to address right now unless if it's causing a significant problem due to the complexity and scope involved. In most cases, SVID TTLs should be causing more frequent signings than update of entry fields like expiry, so it would be interesting to know if there is any edge case that needs more consideration.

We would definitely consider addressing this if we ever do a larger data model refactor down the line.