spire-agent: allow maintaining attestation state per spire-server instance

sorindumitru commented 5 months ago

spire-server becoming unavailable has both immediate and over time effects on spire-agent. The agent will only be able to serve SVIDs from its cache and will not learn about new workloads that have been registered. This affects both X509 and JWT SVIDs, but more so JWT since these more often have to be requested from the server due to unique claims (i.e. audience).

We’d like to see what it would take to get to a model where the database of spire-server instances doesn’t affect the system as a whole. You can already limit the impact a bit using nested spire, each downstream HA group of spire-servers. This allows you to limit the impact of a database being unavailable to a subset of nodes.

We were wondering if it is possible to get to a state where the individual spire-server instances in a downstream group could have independent databases. This mostly works ok, as long as you have some way to share the registration entries between them (e.g. a static file with registration entries that gets applied), but fails on state related to the agent.

For each agent the server persists information about the SVID that was attested for it, including SPIFFE ID, certificate fingerprint and selectors. If an agent connects to a different instance, it won’t be able to attest itself without the information in the database and it will be told to reattest itself.

We’re wondering if it would be possible for the agent to optionally hold attestation information per server instance, to help with this use case.

rturner3 commented 5 months ago

@sorindumitru Thanks for creating this issue. I have a couple clarifying questions to make sure I understand your goals and use case:

Do you use a database with a single node or multiple nodes (primary + read-only replicas)?
Is your end goal to make sure that SPIRE Agents can continue to function by syncing entries and requesting SVIDs to be signed while the server's primary DB replica is unavailable? In other words, if SPIRE Server was in a "read-only" mode where it isn't accepting write requests (e.g. entry create/update/delete) or attesting new agents because the DB primary replica is unavailable, but is still signing SVIDs requested by agents, would that be an acceptable outcome for you?

edwbuck commented 4 months ago

Do you use a database with a single node or multiple nodes (primary + read-only replicas)?

Multiple nodes are being supported, with one independent database per node. These databases are effectively Read-Only to their own server instance. The entries, dns_names, bundles, federation information, and all other "configuration" items are being written to by a different component which maintains the "Write database" that manages the information in the "Read Only database".

This is related to the CQRS pattern https://miro.medium.com/v2/resize:fit:1400/format:webp/1*TaPzEj91HM06UgZoajqGwA.png but the requirements are relaxed a little. Any server can write to its own DB copy provided that the other servers do not require the write in their local database to carry out normal operations.

Is your end goal to make sure that SPIRE Agents can continue to function by syncing entries and requesting SVIDs to be signed while the server's primary DB replica is unavailable? In other words, if SPIRE Server was in a "read-only" mode where it isn't accepting write requests (e.g. entry create/update/delete) or attesting new agents because the DB primary replica is unavailable, but is still signing SVIDs requested by agents, would that be an acceptable outcome for you?

The goal is slightly broader, it also includes the ability for any server instance to attest new agents; and, for servers that are dealing with agent reattestation, for the server to only use its local information in reattestation.

To clarify, consider the following scenario:

An Agent attests to Server instance A, then loses its connection and connects to Server instance B. Server instance B may accept or re-attest the Agent, depending on what the SPIRE team deems appropriate; but, Server B should not fail in any subsequent operation because it lacks information in the database written by Server A.

Subsequent operations also include certificate rotation, where B may be missing much (if not all) of the information that the Agent's certificate was issued. A few (of the many) acceptable approaches might include

Having Server B generate information just-in-time based on the Agent's connection details.
Having Server B force the Agent to undergo full Node Attestation, as it lacks any prior awareness of the Agent.
Having the Agent be forced into "re-attestation on certificate rotation".

sorindumitru commented 4 months ago

Do you use a database with a single node or multiple nodes (primary + read-only replicas)?

The persistent databases we use have multiple nodes, 1 primary + N read-only replicas. It’s usually highly available, but it can have periods of downtime e.g. due to machine issues, upgrades.

Is your end goal to make sure that SPIRE Agents can continue to function by syncing entries and requesting SVIDs to be signed while the server's primary DB replica is unavailable? In other words, if SPIRE Server was in a "read-only" mode where it isn't accepting write requests (e.g. entry create/update/delete) or attesting new agents because the DB primary replica is unavailable, but is still signing SVIDs requested by agents, would that be an acceptable outcome for you?

I’d actually like to go a bit further than that and make sure that spire-server can operate without the need of a shared database, at least for the nested deployment model. The database is problematic to us because:

It’s a single point of failure. This can be managed a bit through nested spire and partitioning the trust domain but it kind of becomes more of chore the more you do it. You have to spin up multiple servers for both spire-server and the database instances for each partition to be able to have high availability.
It’s not something that the teams managing a SPIRE deployment own or should resonably own. It requires database specific knowledge and it’s usually managed by different teams. You end up trusting something external for managing a core part of your trust domain, the trust bundle.

AFAICT, the only item from the database that are actually required to be shared among instances is the bundle (which for the nested deployment is handled by the upstream instances):

Registration entries can be created in all spire-server instances if need be
Same for federation relationships
Agents can re-attest to different instances.

I want to see if and what it would take to support such a model, where the downstream instances could have local, independent databases, e.g. sqlite or local Postgres instance. As long as you have a way to reconstruct the list of registration entries, this database does not matter anymore. You could in fact just populate it at start up. I think there might be some support for this from other users in the community, the proposal in https://spiffe.slack.com/archives/C7XDP01HB/p1711212988306819 seems to be somewhat similar.

Having an agent SVID per instance is the one way to make this work at the moment, but I’m not sure it’s not the only one.

rturner3 commented 4 months ago

The database is problematic to us because:

It’s a single point of failure. This can be managed a bit through nested spire and partitioning the trust domain but it kind of becomes more of chore the more you do it. You have to spin up multiple servers for both spire-server and the database instances for each partition to be able to have high availability.

Have you considered using a multi-primary database architecture? That is a common way to address single points of failure in the database. MySQL offers this as a native feature with group replication, and I believe there are similar solutions built around Postgres (although I have less experience with it, and the only options might be paid services from what I've seen online).

It’s not something that the teams managing a SPIRE deployment own or should resonably own. It requires database specific knowledge and it’s usually managed by different teams. You end up trusting something external for managing a core part of your trust domain, the trust bundle.

SPIRE wasn't really designed to be a fully stateless system. There are several features in SPIRE that are based around the assumption that SPIRE server instances in an HA deployment are sharing a database:

Node aliasing depends on agent selectors saved in the database on node attestation
Agent banning depends on attested node records saved in the database
SPIRE Server CAs and JWT signing keys are saved into the bundle in the database when an upstream authority is not used so that workloads can verify X.509-SVIDs issued by any Server in the deployment (this one sounds like it wouldn't be a concern for your particular use case since this is a child deployment in a nested architecture)
Server authorization of agents enforces the constraint that only a single agent representing a given selector set can be connected to a server deployment at a time by verifying the agent X.509-SVID serial number saved in the database on API requests (@evan2645 may have more context on this design decision)

If HA SPIRE Server instances are not sharing a database, we would need to find a separate solution in SPIRE to replicate this state between instances. This introduces a good deal of complexity to SPIRE Server and a number of new failure modes that would need to be considered. This would be a large shakeup, so I think we need to have pretty strong justification to consider going in that direction.

rturner3 commented 4 months ago

Server authorization of agents enforces the constraint that only a single agent representing a given selector set can be connected to a server deployment at a time by verifying the agent X.509-SVID serial number saved in the database on API requests (@evan2645 may have more context on this design decision)

From offline discussion with @evan2645, this serial number authorization check also prevents a threat actor who has compromised an agent SVID from being able to:

Use this SVID as a client cert to the Server without being detected by the Server (i.e. server doesn't allow multiple connections with the same SVID as the client cert because it wouldn't know which one is the "real" agent)
Renew the agent SVID indefinitely without being detected. In the current design, only one of the "real" agent and the attacker can renew the SVID successfully, but not both. If the attacker renews the agent SVID, the SPIRE operator should be able to detect this because the real agent will start to see its calls to the server fail as unauthorized, which should tip off the operator that there is a rogue agent or actor impersonating as that agent.

sorindumitru commented 4 months ago

I'm away for a week and a bit, but I'll come back to you next week with some answers and more specific scenarios where this could help.

edwbuck commented 4 months ago

From updates within the team, we'd like to withdraw this request.

edwbuck commented 4 months ago

@evan2645 This is a request that in the event the agent was already attested, but the server was not aware of it (due to database restoration / lack of database replication / whatever) the server and agent would fall back into reattestation. The main idea here was that a deployment could use multiple SQLite databases which had their registration entries sync'd by an external third-party platform. For such an approach to work, one would require that each Server instance reattest the agent, when it connects with a valid certificate that originated in a different server instance (and is thus unknow to the Server accepting the connection).

Eventually, the team opted to move with a more mundane shared / replicated database setup. This effort is no longer needed. Please consider closing this issue.

MarcosDY commented 4 months ago

no longer required.

spiffe / spire

spire-agent: allow maintaining attestation state per spire-server instance #4992