opensearch-project / opensearch-migrations

All things migrations and upgrades for OpenSearch

Apache License 2.0

34 stars 27 forks source link

[RFC] Support for dual writes (to a shadow-mode cluster) #91

Open kartg opened 1 year ago

kartg commented 1 year ago

Overview & Need

The aim of this issue is to present the conclusions from #89 as a high-level end-to-end architecture diagram and surface open questions to the community. As a reminder, the purpose of the architecture outlined below is to enable duplication of traffic from a "primary" cluster to a shadow-mode cluster. Such an approach enables validation of the shadow cluster's configuration and behavior while ensuring that the data between the two clusters is kept in sync.

Existing solutions do not adequately solve for this use-case. See https://github.com/opensearch-project/opensearch-migrations/issues/89#issuecomment-1402125011 for the reasoning for not using Logstash or Data Prepper. The need to capture responses from the primary cluster also removes Cross-Cluster Replication (CCR) as a solution to this problem.

Out of Scope

The following aspects of the diagram are out of the scope of this issue:

Setup of the shadow cluster
Transfer of data already present in the primary cluster to the shadow cluster
The design of the Validator

High-Level Architecture

migrations_20230125

Some notes on the diagram above:

Component colors
- Components in blue are expected to already exist for a user.
- Components in green are new ones that the user must set up.
- Components in orange are new ones that are necessary for validation, but otherwise decoupled from the dual-write process.
After setting up the cluster and loading existing data into it, the user would need to change their client(s) from hitting the primary cluster endpoint to the splitter endpoint.
The splitter component
- The main purpose of the splitter component is to duplicate write traffic to the two clusters. Read requests will always be served from the primary cluster.
- The splitter is also a convenient layer at which to enable validation. It can capture request+response data and submit the information to the Validator for analysis
- Note that enabling such analysis would require TLS termination at the splitter. Communication between the splitter and each of the clusters can occur over TLS but requires additional setup

Open Questions

Is the endpoint change (primary cluster to splitter) a barrier to entry for users?
Is the change in TLS termination a concern for users?

chelma commented 1 year ago

My thoughts on the design are as follows. I included the Validator in the write-up as I think it needs to be discussed at least at a high level to understand the vision I'm proposing.

The Splitter would have the following characteristics:

Low-latency, synchronous pass through of traffic to the existing cluster
Data collection for the request/response stream to the existing cluster to enable asynchronous analysis/validation
Initially, synchronous replication of traffic to the shadow cluster and data collection for its responses. Later, it may just forward the existing cluster's collected requests/responses stream to another system which asynchronously replays the stream against the shadow cluster instead of synchronously replicating traffic to the shadow cluster itself. It is not a requirement of the system for the shadow cluster to be synchronously updated alongside the existing cluster. This would open up new use-cases like being able to replay the same collected stream against multiple shadow clusters and select the best one.
In the event that TLS is enabled, the Splitter would need to offer termination/resumption (e.g. man-in-the-middle). An alternative would be to let the existing cluster's OpenSearch Nodes terminate TLS normally and instead perform the data collection/forward on those hosts, but that design-path has its own challenges.
Assuming that the user actually intends to migrate to a new cluster, they'll need to have a way to shift traffic to during the cutover and the Splitter is ideally suited to perform that function when the time comes.

The Validator would have the following characteristics:

It stores the raw requests/response streams from both the existing and shadow cluster in a durable, long-term format with a configurable retention policy. This enables re-ingestion and the ability to audit results.
It defines a clear specification for the format it expects to receive the streams in so that it can support an expanding set of use-cases, such as enabling integration testing by supplying the validator hand-crafted test streams.
It processes/compares the request/response streams to provide a query-able dataset, which itself is stored in a durable, long-term format with a configurable retention policy and has a clear format specification (for the same reasons)
It provides a default report using the query-able dataset highlighting proposed areas of interest, but makes it easy for users generate their own reports based on the underlying dataset. The default report should highlight performance differences (e.g. latency, throughput) as well as correctness differences (e.g. different search ranking results).

My thoughts on whether the endpoint change will be a barrier to entry:

An endpoint change seems unavoidable if a cutover will take place, so with that as a given, the endpoint change itself should not be a blocker for users. If the user already has something like Logstash in place, that could likely be leveraged in lieu of the proposed Splitter with appropriate guidance/configuration, and the user already accepts the caveats (latency, etc) associated with the system otherwise they wouldn’t be using it. Having a formal spec for the data ingested by the Validator decouples data generation from validation. If the user does NOT have a system like that in place, they will need something to shift traffic between their old and new clusters during cutover anyways, a role which the splitter could perform.

My thoughts on TLS Termination:

If TLS MITM at the Splitter is not viable, then it seems the traffic duplication and data collection must be performed where TLS terminates in the Cluster. Having Nodes handle those duties seems like it will impact system performance greater than a MITM approach outside the cluster. Additionally, such capabilities would likely need to be engineered directly into OpenSearch/Elasticsearch, raising questions about maintaining the feature across multiple and whether legacy Elastic Clusters will be supported. Having the Nodes do it seems like a more technically complex solution as well.

gregschohn commented 1 year ago

Thanks for writing this up! I have 3 general concerns. Before I get to them, let's ask...

Should this splitter infrastructure be long lived to assist in recurring updates or will only be setup on an as-needed basis?
What is the primary reason for the splitter - is it to preserve requests and responses so that we can validate and/or hydrate a shadow cluster or is it to keep a shadow cluster 'in-sync' with the source cluster? If we try to do both, the likelihood of failure/inconsistency increases and the complexity to rectify 3 different needs also increases.

General concerns are as follows. 1) TLS - cert management is a pain and a source of operational errors and security risks.
a) We should find out what the distribution is going to be amongst our customers for acquiring/managing another cert on something that may need to be vetted. b) Securing HAProxy against DDOS and spoofing attacks when terminating TLS may require a significant investment. I'm not sure how much of that is already handled in a satisfactory manner. 2) Splitting socket traffic between two versions of clusters quickly becomes ill-formed. If a REST contract changes in just one place, it can have dramatic consequences on the consistency of the two clusters as well as future responses. For example, if a write fails on a target cluster, future queries could be impacted. This limits the utility of the splitter mostly to when the clusters are identical versions. 3) This setup is not very fault tolerant (transient errors, or an API change as indicated above). It will probably work fine for small clusters and we might be able to boost the likelihood of success (in validating) with checkpoint reconciliations. However, as scale in load and duration increases, failures will be inevitable and we'll need additional ways to deal with them. a) We may not have a good way to signal to a user that the expensive target cluster that has been setup could now be useless. b) If the splitter were to crash between sending and receiving a mutating operation from the source cluster, but before the contents were committed to either the log and/or the source cluster, what would be the right thing to do? If we ignore it, the target isn't complete and the logs for validation won't be either. It feels like we'll need some kind of coordination with the translog to reconcile our extra information when something goes wrong.

Given the above, I'm wary about splitting traffic to a target server because of it's narrow application and fault-intolerance. I'd like to see a comparison of how this compares to state of the art, or if it is state of the art, how users deal with transient failures. Streaming the requests and responses elsewhere could decouple how those get logged and how they get committed to a target cluster (or multiple ones when evaluating variant clusters).

gregschohn commented 1 year ago

Have you considered dividing the responsibilities of the splitter. You could make the splitter reliably stream requests for the source cluster, reliably logging requests & responses using something like a sliding window transaction. When we detect that we have a hole (which would require additional monitoring from other HAProxy nodes when this is distributed), we at least find out what mutations have occurred (from interacting with the cluster) and can indicate that non-mutations may be missing entirely.

Then we can re-stream that reliable stream into another set of clusters, which can do further logging and could in turn kick off analysis if we wanted to - but we probably don’t want to, since validation analysis is going to be more batch focused. The advantage of offloading target writes means that we can do it faster if that’s cheaper. We can also spend more time retrying. It can be time shifted altogether so that there can be some supervision if things go wrong. We can also add some rewriting logic (again, supervision and performance would be different than if we were to attempt to do this on the splitter directly). Lastly, we could do all of this in real-time as well, but should still probably be asynchronously. That should give this platform the broadest applicability.

If we didn't want to make logging mutations transactional, we could also take the imperfect logs and replay them against a newly stood up source cluster and the target. That would at least setup consistent ground truths (provided that we're writing carefully, which should be easier since it wouldn't be on a production workflow).

So, all of the above could drastically reduce costs and could increase fault tolerance, which could be implemented now or later.

chelma commented 1 year ago

Thanks for the feedback, @gregschohn.

I think it's important to keep two things in mind.

The design should be iterable and expand to ever-improving states, but does not have to be perfect at the beginning
There are no magic wands and no perfect solutions. Tradeoffs are inevitable. This is doubly true when security questions enter the picture.

That said, here are my thoughts.

Should this splitter infrastructure be long lived...

Good question, and it's something we can consider as a piece of advice for users. Obviously, the longer lived the splitter is the more its setup cost is amortized.

What is the primary reason for the splitter...

INITIALLY it is proposed that the splitter both keep the primary and shadow clusters in sync as well as collect the req/res data from both streams for offload to the Analyzer/Validator. However, as I explained above, there is no reason that the shadow cluster must be updated synchronously w/ the primary cluster and I see value in merely having it relay the req/res stream from the primary to another system to replay on the shadow cluster. That said, the proposed design can easily iterate into that at a later date when we have time to invest in a more sophisticated replay mechanism.

TLS - cert management is a pain and a source of operational errors and security risks

I spent years managing hardware security modules so I completely get your point here. However - what's the alternative? There are no magic wands we can wave so we have to accept a tradeoff. If we want to support TLS but can't do MITM, then I don't think the splitter is a viable option. Now maybe that's the route we need to go down, but it's not like there aren't problems on the other side as well as I talked about above. The real question is which route offers the most overall value?

Splitting socket traffic between two versions of clusters quickly becomes ill-formed...

I don't quite understand your concern here; it seems axiomatic that different ES/OS versions would return different results when the same traffic is replayed on them. Having something to faithfully record the results would still be helpful in sussing out what the difference were.

This setup is not very fault tolerant...

First, no distributed system is perfect and they all involve tradeoffs. The question is, which tradeoffs are acceptable? It's unclear from your statement above which specific faults you're concerned about and which specific tradeoffs you think we should make to address them. I don't think we will find a solution that guarantees perfect capture of the req/res stream AND no impact to the user's production traffic, which is what it sounds like you're looking for.

It seems inevitable that if you add one or more hosts between two communicating machines, there is some scenario where issues with the intermediate hosts disrupt the communication channel. We can apply best practices (like HA) to ensure that if a splitter goes down another can take its place. I could be wrong, but I'd personally prioritize maintaining the user's existing data stream over accurate sample collection, and I accept that we will never have a perfect data stream because it feels the cost to collect one is too high.

chelma commented 1 year ago

Have you considered dividing the responsibilities of the splitter.

I think my previous response should address your concerns about splitting the responsibilities, but in short - I agree the responsibilities should be split in the long term. However, I'm trying to devise a sequence of deliverables that we can iterate on with agility and in my mind the first deliverable is synchronous traffic splitting between the primary and shadow clusters, not a replay mechanism. I'm on-board with exploring replay as a future iteration.

kartg commented 1 year ago

What is the primary reason for the splitter ....

@chelma @gregschohn I think "primary reason" may be the wrong way to look at this. Could we rephrase this question to "why is the splitter required" ? Then the answer becomes:

To capture the response from the primary cluster
To capture actual user requests for the purpose of replay/validation

As I noted, no existing solutions would solve for these two requirements. Since solving for these requirements necessitates building the splitter layer, we're choosing to also build in the following:

Keeping the clusters in sync (otherwise solveable by CCR)
Validating the clusters against each other

Should this splitter infrastructure be long lived to assist in recurring updates or will only be setup on an as-needed basis?

As @chelma said, I think we should treat the splitter layer as a "recommended-but-optional" piece (akin to the security plugin). Users can certainly deploy OpenSearch without it, but there are benefits to having it in place - particularly as we enhance the splitter over time.

In general, I agree with this:

The design should be iterable and expand to ever-improving states, but does not have to be perfect at the beginning

setiah commented 1 year ago

I suspect setting up TLS termination outside of existing setup to be a point of concern/friction for users. First they would need to test it against their application load characteristics to make sure it handles the load. It also changes their security model of their application.

Can we break down the use cases it solves as separate requirements and evaluate alternatives for each individually? This would help us know where a t-proxy like such is the only viable option. For instance live data replication, can probably work without this with other options like incremental snapshot restore, CCR or Data prepper.

nknize commented 1 year ago

The need to capture responses from the primary cluster also removes Cross-Cluster Replication (CCR) as a solution to this problem.

There are a lot of good design ideas here, and CCR has an ongoing iterative effort to redesign large components leveraging SegRep, including bringing CCR as a plugin in the core. So, could we bring these two efforts together and build primary cluster response as a configurable extension or component to CCR? This way the CCR plugin could be used to setup a "shadow cluster" (in the same manner it sets up a "follower" cluster today)? Some nice properties of CCR include the security model (document or field level) already baked in (though it will need to be refactored to a core security module), users can define replication at the field level (providing the ability to down sample a shadow cluster on migration), and the experience is already largely defined and familiar to the OpenSearch users.

wbeckler commented 1 year ago

That would be really cool to combine efforts, but a solution for the problem migrating data to a new cluster would need to work without having to significantly upgrade the cluster to enable the solution.

kartg commented 1 year ago

This way the CCR plugin could be used to setup a "shadow cluster" (in the same manner it sets up a "follower" cluster today)

@nknize I really like this idea. I've added an issue to track this in our repo - https://github.com/opensearch-project/opensearch-migrations/issues/100

That said, as @wbeckler noted, this would only be useful for migrations starting at the version where CCR is integrated into core. Is there a way to bring this functionality to older versions (of ElasticSearch or OpenSearch) ?

dblock commented 1 year ago

I'll try to avoid repeating the many good comments and concerns above.

Is the endpoint change (primary cluster to splitter) a barrier to entry for users?

Yes, I think it's a barrier, but also a question of trade offs. I believe very few people run with custom DNS endpoints. Having to change both client and server requires a lot of coordination.

I like the suggestions about leveraging CCR, but it has major limitations. In a rolling upgrade we put the cluster in a mixed state, then upgrade node-by-node. We could implement something similar, but I think a shadow cluster is not enough because we can't fail over write traffic, so you want bi-directional replication. A big limitation of this approach is that to implement such replication with failover would require new code, therefore migrating from older versions of the product is not improved. You could write plugins for older versions of the product, but that may be challenging.

With traffic splitting you need to consider authentication and authorization to the shadow cluster. If you're going to be replaying requests, the credentials in that data may be expired.

zengyan-amazon commented 1 year ago

Regarding determine shadow cluster catch up with primary cluster use case. One design I had seen in the past was:

letting the shadow to start receiving live traffic (maybe after restoring from snapshot)
after shadow start receiving live traffic, replay the transactions from translog, till the time point that shadow start receiving live traffic, when processing replay, ignore the stale data records based on doc revision (primary term & seqNo in OpenSearch)
once the translog records are processed, the shadow is caught up with primary and keep receiving live traffic

This approach made the request replay a bit complex, but it can avoid the synchronization between primary and shadow to determine if shadow is up-to-date with primary

nknize commented 1 year ago

This approach made the request replay a bit complex, but it can avoid the synchronization...

Replaying history from the translog is largely what CCR does (with some retention lease optimizations). So I'm +1 this design approach.

What I'm suggesting in my comment is we don't build a whole new axel and wheel but design the shadow cluster as an aftermarket accessory to the existing CCR. CCR is already designed to follow a cluster, but (according to this design description) the missing piece is capturing response from the primary.

Maybe that's as simple as a new DiscoveryNode that connects to primary just to capture the response and let the "shadow cluster" simply be 1.x/2.x CCR. My comment about designing for segrep is important so we don't spend additional effort perfecting something we can bake into the next segrep CCR design (and deprecate this approach sooner than later).

Work backwards from combining the efforts where possible.

Also, what's the earliest primary version you want to shadow?

chelma commented 1 year ago

Based on the (potentially limited) data I have, the majority of potential migration users are on Elasticsearch 6 or newer. If that assumption holds true, then perhaps it's reasonable to request that any users on older versions of Elastic upgrade to 6.X before proceeding with a migration. Over time, the number of such affected users should naturally approach zero.