kirutthika commented 4 years ago

Background When Read replicas are enabled for SPIRE, all reads to GetNodeSelectors and ListRegistrationEntries datastore operations are directed to read replicas with the tolerateFlag set to true with some allowed data staleness. This feature helps with scaling the control plane horizontally by moving read heavy operations to read replicas.

Problem We recently conducted few game day experiments with SPIRE, especially around availability of SPIRE with some downtime on SPIRE database. We observed certain availability bottlenecks with FetchX509SVID calls due to its dependency on primary database, which could potentially be mitigated with read replicas.

A summary of the experiment is added below.

Scenario: Brief outage of SPIRE DB write cluster

The first pic shows the availability drop on FetchX509SVID success. The second pic shows the corresponding drop in database reads. The yellow line in the second graph represents the reads in write DB instance and the orange and green lines represent reads in read replicas.

The second and third pic shows the availability drop of FetchAttestedNode and FetchBundle datastore operations during this experiment.

FetchX509SVID internally calls FetchAttestedNode and FetchBundle datastore operations, both of which are not allowed to go through read replicas. AgentSVIDs are validated by fetching the attested node as a first step in FetchX509SVID by looking for the nodes in write DB. Hence during the event, the availability of FetchX509SVID API went down to zero since all calls to the write DB instance started failing.

Proposed Solution To maintain availability of a critical API such as FetchX509SVID in the event of the primary DB outage, it is preferable to have some tolerated staleness to the datastore operations. To have the tolerated staleness propagated across all necessary datastore operations, we could define a global database boolean state flag within datastore modules such as readOnly/failOver (default: false) which would be used by most datastore read operations to direct all reads to read replicas when set to true. Clients can build their own circuit breaking mechanism at their end to switch to the failOver state in the event of a failure.

With a failOver logic like above, an outage to the control plane will just impact new nodes and workloads and the existing agents and workloads will have minimal impact and continued to be served by the read DB replicas.

We could also discuss any other potential failOver alternatives to improve the fault tolerance and availability of FetchX509SVID API.

azdagron commented 4 years ago

Hi @kirutthika, thanks for putting together this detailed issue. In general, I think having a way for the datastore to switch into read-only mode to survive an outage sounds useful. I have a few questions that I'd like to float out there that I think we'll need answered before we come up with a solid plan:

Should SPIRE be responsible for somehow notifying operators that they are in this specific situation (so failover can be enabled) or should that be out of scope?
Should SPIRE be responsible for somehow notifying operators that the master is now operational (so failover can be disabled) or should that be out of scope?
If SPIRE can do #1 and #2, would it then be appropriate for the SQL plugin to auto-detect the situation and turn on and off failover automatically?
Assuming triggering failover is manual, how do we communicate the failover intent to the SQL plugin? Is this something we add a new RPC for in SPIRE server and subsequently a DataStore plugin interface change? Is there some other ergonomic way we can keep this scoped to the plugin without having to introduce plumbling?

I'd probably lean towards manual activation versus some heuristics based approach.

I am curious about thoughts on how the plugin is told to failover. There are probably many approaches here. One is specific RPC and generic datastore plugin support. Another is some generic model for pushing dynamic configuration into plugins involving a new kind of auxilliary plugin service.

azdagron commented 4 years ago

@evan2645 reminded me of something we discussed when originally introducing "TolerateStale" support (#1385). Read replicas were originally introduced to solve a scaling problem because of the entry authorization check performance. Using them to satisfy sensitive queries, when they may not be up-to-date or in sync has security implications, including potential impact on authorization checks. This is what led us to scope things down to very specific resources for #1385.

Thinking about it more, even if we took this feature, it seems that it would only help agents continue to rotate workload SVIDs. Entry registration, node attestation, rotating agent X509-SVIDs, possibly server CA rotation, would all still be broken, among other things. While it would be nice to continue rotating workload SVIDs while a database master is unavailable, future feature enhancements might make that impractical (e.g. recording all X509-SVID issuance in the database during rotation for revocation purposes).

In my mind, the most important properties to maintain in the event of some sort of SPIRE dependency outtage are:

Agents continue serving cached SVIDs to workloads
Servers and Agents recover gracefully when the dependency is restored.

Maintaining both of those properties is largely a function of the TTLs assigned to various certificates (Workload X509-SVIDs, Agent X509-SVIDs, CA certificates). Both the server and agents are relatively aggressive in renewing certificates well before they expire.

All this to say that I wonder if the complexity involved with putting in this kind of feature is really worth it? Does it move the needle far enough in perceived robustness (i.e. workloads have good X509-SVIDs) to justify the additional complexity and outweigh the possible weakening of SPIRE's security posture?

mcpherrinm commented 4 years ago

1542 seems related: Some kind of ops-driven need for a failure condition

mcpherrinm commented 4 years ago

I'd like to see a related thing too: If a particular read replica fails, I'd like to be able to have reads go to the primary (as the individual read replicas have lower availability guarantees than the primary in our topology)

mcpherrinm commented 4 years ago

One thing I'm wondering: In my infrastructure, our remediation for a DB primary failing is to promote a read-replica and reorganize our topology -- which of course may not be possible always (in particular that would be tricky in some async replication setups).

I think this sounds like a reasonable-ish feature, though I think it might be a bit complicated to implement.

kirutthika commented 4 years ago

@evan2645 reminded me of something we discussed when originally introducing "TolerateStale" support (#1385). Read replicas were originally introduced to solve a scaling problem because of the entry authorization check performance. Using them to satisfy sensitive queries, when they may not be up-to-date or in sync has security implications, including potential impact on authorization checks. This is what led us to scope things down to very specific resources for #1385.

Thinking about it more, even if we took this feature, it seems that it would only help agents continue to rotate workload SVIDs. Entry registration, node attestation, rotating agent X509-SVIDs, possibly server CA rotation, would all still be broken, among other things. While it would be nice to continue rotating workload SVIDs while a database master is unavailable, future feature enhancements might make that impractical (e.g. recording all X509-SVID issuance in the database during rotation for revocation purposes).

In my mind, the most important properties to maintain in the event of some sort of SPIRE dependency outtage are:

Agents continue serving cached SVIDs to workloads

Servers and Agents recover gracefully when the dependency is restored.

Maintaining both of those properties is largely a function of the TTLs assigned to various certificates (Workload X509-SVIDs, Agent X509-SVIDs, CA certificates). Both the server and agents are relatively aggressive in renewing certificates well before they expire.

All this to say that I wonder if the complexity involved with putting in this kind of feature is really worth it? Does it move the needle far enough in perceived robustness (i.e. workloads have good X509-SVIDs) to justify the additional complexity and outweigh the possible weakening of SPIRE's security posture?

@azdagron, Thanks for your comments and sorry for the late reply.

I understand the concerns that you and Evan are pointing out. With a state such as failOver, it is implicit that most queries will not be up-to-date and limited to fewer operations. As you pointed out an important property of dependency outage- the agents continuing to serve cached SVIDs to workloads during outage. Other than the TTL, having this feature would help with rotating the latest workload SVIDs up until the event of failure and continue serving the SVID fetch without any failures at the agent side until the primary DB is recovered. We have alerting mechanisms in place that would treat this as a critical event, so the mitigation and recovery of primary database would take place, while we maintain operational availability to the extent possible with read replicas. Also having this feature would help us to gradually recover from the failOver state to full operational state so that the primary DB is not subjected to throttling rates of throughput during recovery. This would also serve as a starting point to think of other fault tolerance measures to be put in place for shifting the heavy operational dependency on primary write DB, while also evaluating security implications along the way.

Let me know if you have any other ideas in this area that can we can potentially consider to avoid these failures.

Also interested to know more on the effort around -‘recording all X509-SVID issuance in the database during rotation for revocation purposes’. Can you please elaborate more or point to any github issue?

As for the other comment on implementation -

For 1, 2 and 3, IMO, I think it is better for clients to lever and detect this situation with the alerting mechanisms and take necessary actions. This would keep the SPIRE side implementation simple to start with. We can definitely revisit auto-detection at the SPIRE side in the future.

For 4, am currently thinking of 2 ways to communicate the failOver to the datastore plugin. Each has its own set of pros and cons. Am listing out the approaches now. But will dive deep further and can get back with more datapoints on which seem more favorable.

a) Have separate RPC call to communicate the intent of failOver and inject the failover state into datastore plugin. Pros: Dynamic, can help automate circuit breaking with less interactions; Cons: Controlled/secure access to this RPC is required.

b) Implement the failOver state as a datastore plugin config in SPIRE server configuration.As part of circuit breaker, we can create failOver flag files in mounted directory on spire-server hosts that can be populated with the relevant flags. Pros: Controlled access to the failOver state, cons: require spire-server restart, additional work to ensure spire-server initializations are not dependent on primary DB. Will also explore if we can dynamically inject these configurations.

azdagron commented 4 years ago

recording all X509-SVID issuance in the database There isn't an issue for this but basically in the future we want to support active revocation of X509-SVIDs. We haven't quite defined what this means yet. It could be something as simple as dropping a trusted X.509 CA from the bundle. It could also be per-X509-SVID revocation, which may mean accurate, database-level accounting of what X509-SVIDs have been issued.

You make some fair points about slowly scaling back up write load on a primary DB after recovery that I hadn't considered. @mcpherrinm also points out a realistic feature in the opposite direction to help recover from a failed read-only replica.

I personally have some level of discomfort about introducing this feature, mostly around security posture and ergonomics around enabling/disabling the failover across a cluster. The things you bring up in option b highlight some of that.

That being said, I don't think this feature adds any true complication to the datastore implementation, and as long as it is opt-in (with a STRONG disclaimer in the documentation about potential implications), I think I'd be fine with seeing this feature added.

I think coming up with a good strategy to activate failover in the SQL plugin is tricky. You outlined many of the pro/cons. My additions are in bold.

Configuration via RPC:

Pros:

Can be modified without server restart

Cons:

Controlled/secure access (I think this could be largely mitigated by scoping access to the server unix domain socket)
Need new plumbing from this RPC down to the SQL plugin, requiring changes to the DataStore interface or addition of a new auxilliary plugin service
Need way to query state so it can be reconciled
Do we need to persist the failover state intent?

Configuration via config file:

Pros:

Controlled access to failOver state
No new plumbing required
Failover intent is naturally persisted

Cons:

Requires server restart
Additional work to ensure spire-server initialization is not dependent on primary DB

I don't think that second con for the configuration file is realistically addressable since writes would be required for the server to create a new intermediate CA.

Allowing configuration through RPC is more or less introducing new precedence into SPIRE: the ability to dynamically adjust configuration. This isn't something we currently allow at any level. We'd want to take care to introduce it properly and not in reaction to a single feature. That isn't to say we shouldn't take this course, just that if we do, it might take some up front design to make sure it introduces mechanisms/paradigms that are generally wanted and useful.

I'd love other's thoughts on this matter.

APTy commented 4 years ago

This is a great discussion :100: :100: :100:

Diving deeper into the topic, it seems like there is significant complexity in supporting the capability for SPIRE to "fail over" certain db actions to a read-only replica in the event of a failure of the primary write db.

Given the cost of implementation and that the solution would involve a pretty steep learning curve for operators to understand when/why to use this and how to use it correctly, I wonder if it's a worthwhile feature in the near term for SPIRE.

The primary motivation -- for SPIRE to be available in the likely event of a master db node failure -- is largely solved by spire agent's pre-fetch of SVIDs, and a change like this may not actually provide a very large bump in the overall availability of the system.

Also, if operator intervention is required in order to accomplish the failover, I wonder if that time is better spent at the database layer, e.g. by re-routing write traffic to the read replica outside of SPIRE.

mcpherrinm commented 4 years ago

If we can get FetchX509SVID and FetchJWTSVID to hit the read-only replica paths for workload SVIDs, would that help sufficiently?

Ie, instead of having an explicit "failover" state, we just always use the read-replica?

Is there a strong reason to not do that?

azdagron commented 4 years ago

In general, there is some concern about not using the write master to read attested node information when authorizing the agent.

azdagron commented 4 years ago

A bunch of us got together to discuss this. There was some good discussion around TTLs and the agent cache as the primary focus for availability in the face of infrastructure outage.

The consensus was to see if we can figure out an alternate way that doesn't involve needing to introduce a specific failover feature.

Like @mcpherrinm mentioned, if we can come up with a way to be comfortable hitting the read-only replicas for the agent authorization check, that would be very convenient. Maybe that operation could hit the write master first and then fall back automatically to the read-only replica?

kirutthika commented 4 years ago

Agreed on the alternate approaches to falling back to read replica completely. That would be more manageable. @azdagron and I could research a bit more on this area and lay out further details and how to circumvent any security postures.

azdagron commented 4 years ago

I've been thinking about this off and on for a bit. The only thing I can think of doing is to, operation by operation, define the primary connection used to satisfy the query and optionally a backup connection. The operations could be divided into the following groups:

Security-sensitive read operations (e.g. FetchAttestedNode)
Security-insensitive read operations (e.g. those that we've added TolerateStale to in order to improve read-replica usage)
Write operations

For group 1, we try the write master first and fall back to read replicas. For group 2, we try the write master first and fall back to read replicas, unless TolerateStale is true, and then we do the inverse. For group 3, we only try the write master (obviously).

By "try", I mean we attempt the query and only fall back for a non-query related error (i.e. database connection failure).

This still makes me a little nervous, but I haven't been able to think up a concrete attack vector that isn't already present and mitigated.

@evan2645, @mcpherrinm, @APTy, @amartinezfayo what do you think? Thoughts? Concerns?

kirutthika commented 4 years ago

I had a similar thinking process of differentiating between read and write operations, as an alternative to the failOver state affecting all operations. Only thing am wondering on is - if there is a way that we could somehow try to make all reads to be not security sensitive, which is in my opinion the complex part of this problem.

Also perhaps target just the read operations along the FetchX509SVID path if not all the SPIRE operations.

azdagron commented 4 years ago

I've thought up a potential race condition if FetchAttestedNodes uses the read replicas.

Agent renew's SVID, causing an update to the AttestedNodes table with the NewSerialNumber
Agent drops TLS connection and re-connects with new SVID
Authorization layer calls FetchAttestedNodes to obtain record for agent to verify the NewSerialNumber matches that of the incoming certificate.

If there is lag on replication between 1. and 3. such that the read replicas return the stale record, then the agent request will fail with PermissionDenied. That failure, in conjunction with work we are planning on to trigger re-attestation could cause an agent to try to re-attest for no good reason.

azdagron commented 4 years ago

I've dropped a patch introducing an experimental flag into https://github.com/azdagron/spire/tree/experimental-prefer-ro-conn-for-reads, that causes all reads to hit the read replicas. You configure it like so:

Datastore "sql" {
    plugin_data {
        ...
        experimental {
             prefer_ro_conn_for_reads = true
        }
    }
}

If we can try this out in a loaded system and monitor for PermissionDenied status codes returned from the authorization layer, it may help us determine if the above race is practical.

azdagron commented 4 years ago

I kind of feel like we're getting down into the weeds a little here. I feel like we are maybe focusing deep on a solution without understanding what impact it's going to have to the overall availability of the system, particularly from a Workload point of view.

Can you remind me the circumstances that led to #1517 being opened? In particular, what alerted you that something was wrong? Was it error logs/metrics emitted by the server? by the agents? or was there actual downtime for existing workloads being serviced by agents?

We've mentioned this a few times on the call, but here is a small list of operations that will not work when the write master is down:

Agent attestation
Agent SVID rotation
Server CA rotation
Workload registration
Fetch Workload SVIDs*

The solutions we've discussed thus far have all been laser focused on that last item, fetching workload SVIDs. However, when in this state, there are so many other things that will not work, from an infrastructure perspective, you will already be in hot water. Particularly, until the write master problems are resolved, agent SVID rotation cannot be performed and agents will fall out of the trust domain, which will require re-attestation, which is generally a painful process.

From a workload perspective however, the agent cache should already be hot with workload SVIDs with at least half of their remaining TTL. Workloads should more or less be oblivious to the outage already. Agent SVIDs will also have at least half of their remaining TTL. Changing things up to allow for workload SVID renewal by for an additional that half-agent-TTL amount of time seems like small gains, considering the larger, looming infrastructure outage caused by agents dropping out of the trust domain if the write master is not fixed.

I'll also mention that this scenario closely matches up with that described in #1542. It seems like "break-glass" mode would be an appropriate response to the database master going down (i.e. a scenario in which SVID rotation can't happen). Instructing workloads to ignore expiration for some time until the outage is repaired seems like an appropriate mitigation strategy.

Last, we've recently discussed adding a configurable that allows operators to tune the refresh threshold on SVIDs #1754. This would further enable you to turn up rotation frequency to give you more time to continue to provide SVIDs to workloads while fixing the outage.

To summarize, I think we're digging in and trying to mitigate the issue at a lower level that doesn't end up buying us as much as we'd like. The ultimate goal and purpose of SPIRE is to continue serving SVIDs to workloads. The agent cache is the primary defense for continuing to accomplish that goal while surviving infrastructure outage. I think a focus on other solutions, like "break-glass" mode and a configurable rotation interval will go much farther than efforts to fail over to read replicas when the write master is broken.

rturner3 commented 2 years ago

Closing this issue due to staleness. If you encounter any other problems around database reliability, please feel free to open a new issue.

spiffe / spire

Add failOver state for datastore operations #1517

1542 seems related: Some kind of ops-driven need for a failure condition