opensearch-project / cross-cluster-replication

Synchronize your data across multiple clusters for lower latencies and higher availability
https://opensearch.org/docs/latest/replication-plugin/index/
Apache License 2.0
48 stars 58 forks source link

[FEATURE] Use Pluggable translog for fetching the operations from leader #375

Closed saikaranam-amazon closed 2 years ago

saikaranam-amazon commented 2 years ago

What are you proposing? We're working on utilising Segment Replication for CCR and plan to make that as a default choice for replication. However, we aren't planning for deprecating the logical CCR as of now.

To support logical replication in the long run, we propose relying on Pluggable translog for fetching the operations for CCR(logical). More details here Here are the key points:

Why support logical replication?

Local replication on leader Translogs Plans to be supported in near future Source
Logical Local yes primary and replica
Logical Remote no primary
Logical No-op no can fetch from lucene
Segment Local yes primary
Segment Remote yes primary's remote
Segment No-op yes can't fetch

How did you come up with this proposal? Follow up from https://github.com/opensearch-project/OpenSearch/issues/1100.


What is the user experience going to be? Cross-cluster replication (CCR) simplifies the process of copying data between multiple clusters. Users can use CCR to enable a remote cluster for the purpose of Disaster Recovery or for data proximity.

Currently, CCR leverages logical replication to copy data from leader to the follower index. For which it fetches the operations on the leader index from the translog.

Benefits of doing this change While, the proposed feature does not fundamentally change the experience of CCR. It adheres to better design principles and best practices which will ensure compatibility with future engine changes. By moving the fetching operation to the pluggable translog it provides the opportunity to develop - active-active replication, replication between incompatible OpenSearch versions, and upgrades of leader or follower index without breaking ongoing replication.

The ability to fetch operations directly from the translog manager allows us to make CCR agnostic from the replication mechanism used inside the cluster as we also add support for segment replication. We'll also build support on top of it for remote translog.


Why should it be built? Any reason not to? This needs to be built so that we can keep supporting logical CCR 3.0 onwards Only reason not to support this would be if we want to solely rely on Segment replication for CCR.

What will it take to execute? Changes done in the PR should take solve the problem for now. In future, when we start relying on remote translogs, we'll need to add support for fetching the operations from leader shard's remote store directly.

What are remaining open questions? N/A


Is your feature request related to a problem? Provide extension point for Tlog fetch operations under OS Engine:

ankitkala commented 2 years ago

We do plan to move the method newChangesSnapshotFromTranslogFile out of Engine.

Since we intend to move to Segment Replication eventually, we decided not to build extension support for translog fetch operations. We'll instead move these methods to Pluggable Translog.

Since the pluggable translog has been moved out of 2.1 release to 3.0, we'll address this when the Pluggable Translog is ready.

ankitkala commented 2 years ago

Ideally we want to move to Segment Replication by 3.0 with most logic implemented in the core. But incase, we are not able to build that support in time or we still need logical replication, we'll utilise the PluggableTranslog to fetch the changes.

ankitkala commented 2 years ago

@nknize @dblock Let me know if you've ay concerns. We'll move the method newChangesSnapshotFromTranslogFile to pluggable translog which will go along with Pluggable translog in one of the 2.x release.

nknize commented 2 years ago

:tada: No concerns from me. This is great; thanks for opening this issue. Translog had been far too coupled with the Engine because of document replication. With the move to segment replication, not only is further decoupling the right move this will help us take the next step to promoting it to a first class mechanism so it can be optionally used based on a user's durability requirement.

dblock commented 2 years ago

cc: @mch2 @kartg @andrross

mch2 commented 2 years ago

No concerns here either. With segrep we will require the newChangesSnapshot method that isn't built off the xlog itself to fetch ops from the index, but not newChangesSnapshotFromTranslogFile.

ankitkala commented 2 years ago

@Bukhtawar I've listed down the combinations of local replication and translog on leader.

Local replication on leader Translogs Plans to be supported in near future Source
Logical Local yes primary and replica
Logical Remote no primary
Logical No-op no can fetch from lucene
Segment Local yes primary
Segment Remote yes primary's remote
Segment No-op yes can't fetch

For the next release atleast, changes done in PR should suffice.

Eventually we'll need to these 2 additional handling on CCR side:

ankitkala commented 2 years ago

Summarizing the problem statement and overall approach for clarity & visibility.

Problem statement: With OS 2.0, as part of index.soft_deletes.enabled deprecation, the newChangesSnapshotFromTranslogFile method was removed from core engine with assumption that its a dead code as OS core doesn’t rely on it. This method was added for CCR operations and removal of this method broke the CCR implementation. As follow up, valid concerns were raised around the dependency of CCR plugin on core OS methods which are implementation detail and aren't guranteed by OS to be always present.

Action Items: Here are the 3 follow up action items from this:

Plan for the action items: Here is our plan for the 3 action items(ordered in priority). Refer here for more details.


Proposal for dependency on Pluggable Translog: Even with Segment Replication as default choice for CCR, we might not deprecate logical replication right away. Assuming the worst case scenario where we want to keep supporting the logical replication, we'll always have a dependency on the translogs for fetching these operations. With the proposed changes, TranslogManager would support the newChangesSnapshotFromTranslog method which CCR will rely for fetching the changes.

One caveat here is that CCR currently load-balances the requests between primary and replicas on leader for fetching the changes. This might not work with Pluggable Translogs. For example:

Proposed changes:

ankitkala commented 2 years ago

@Bukhtawar Jotting down my thoughts based on our discussion today.

Adding folks for visibility. @krishna-ggk @saikaranam-amazon @nknize @elfisher @mch2 @sachinpkale @ashking94

krishna-ggk commented 2 years ago

Thanks @ankitkala for the thorough details. Going to focus comments on pluggable translog.

The final proposal seems fair. Couple questions

  1. Is it fair to assume that the translog store can't be updated for active index? If so, CCR should perform the validation in the start API itself.

Incase of remote translogs, we'll not rely on the getHistoryOperationsFromTranslog method added https://github.com/opensearch-project/OpenSearch/pull/3948 and instead would fetch from remote store directly.

  1. How are we thinking about the security model here?

  2. Have we evaluated primary/replica leader shards proxying the call to remote translog in which should support load-balancing pull requests?

Bukhtawar commented 2 years ago

Thanks @krishna-ggk

How are we thinking about the security model here?

The model could be same as the current flow. The way I envision this happening is decoupling the control flow and data flow. The control flow can still be the way it exists today with the existing security model, like follower fetching the leader's checkpoints while the data flow would latch on directly to remote store with a new data security model. The advantage with this is, we retain most of what exists today while not overloading N/W bandwidth on leader cluster.

Have we evaluated primary/replica leader shards proxying the call to remote translog in which should support load-balancing pull requests?

If would prefer a decoupled control and data flow to fetch data directly from remote translog store as described above.

For additional context Control flow : Follower -> Leader : Leader cluster just gives metadata or checkpoints not the actual data Data flow: Follower -> Remote store : Actual data transfer happens directly from the remote store

krishna-ggk commented 2 years ago

Thanks for expanding @Bukhtawar .

Yes, agree on the benefits of directly querying remote store. Like you pointed out, the security model for the data flow would be key. The abstraction need to support varied stores with different permission model. This also raises back @ankitkala 's question on whether we need to expose the translog type (Remote/None/Local) or we can find a model that works across (the latter preferred ofcourse).

ankitkala commented 2 years ago

The way I envision this happening is decoupling the control flow and data flow. The control flow can still be the way it exists today with the existing security model, like follower fetching the leader's checkpoints while the data flow would latch on directly to remote store with a new data security model. The advantage with this is, we retain most of what exists today while not overloading N/W bandwidth on leader cluster.

Yep. This completely aligns with how I'm also thinking about security for fetching directly from leader's remote store (similar approach for Segments as well).


This also raises back @ankitkala 's question on whether we need to expose the translog type (Remote/None/Local)

I'm slightly inclined towards having the translog type exposed but want to see inputs from @Bukhtawar first. One benefit that this provides is that CCR can be deterministically aware and figure out the mechanism needed to fetch the operations.

Incase we don't want to expose the translog type from the TranslogManager interface, then I see these 2 potential options:

  1. Responsibility will lie with CCR plugin to figure out(via Settings and FeatureFlag)(not preferred) whether translog are local/remote/none and operate accordingly.
  2. Alternatively, the logic to fetch from leader's remote store can reside in the PluggableTranslog as CCR would just rely on Pluggable Translogs instead (unless we're thinking about extensibility in PluggableTranslog which might be a overkill IMHO)
Bukhtawar commented 2 years ago

Eventually we'll need to these 2 additional handling on CCR side: Only do load balancing if leader has local translog. Throw validation failure if leader index has opted to no durability(i.e. no tranlogs)

Can you please confirm if we need fetching operations from translog from leader cluster if segrep is enabled? Can we totally avoid the call in that case

rohin commented 2 years ago

What is the user experience going to be? Cross-cluster replication (CCR) simplifies the process of copying data between multiple clusters. Users can use CCR to enable a remote cluster for the purpose of Disaster Recovery or for data proximity.

Currently, CCR leverages logical replication to copy data from leader to the follower index. For which it fetches the operations on the leader index from the translog.

While, the proposed feature does not fundamentally change the experience of CCR. It adheres to better design principles and best practices which will ensure compatibility with future engine changes. By moving the fetching operation to the pluggable translog it provides the opportunity to develop - active-active replication, replication between incompatible OpenSearch versions, and upgrades of leader or follower index without breaking ongoing replication.

The ability to fetch operations directly from the translog manager allows us to make CCR agnostic from the replication mechanism used inside the cluster as we also add support for segment replication and the location of the translog.

ankitkala commented 2 years ago

Eventually we'll need to these 2 additional handling on CCR side: Only do load balancing if leader has local translog. Throw validation failure if leader index has opted to no durability(i.e. no tranlogs)

Can you please confirm if we need fetching operations from translog from leader cluster if segrep is enabled? Can we totally avoid the call in that case

If CCR is using SegRep, we won't even fetch operations from translog. But if CCR is on logical(and leader on SegRep), we'd still fetch the operations from translog.
By default, we are not thinking of giving this as an option to the customer. But the intention is to keep the support for CCR logical intact even if SegRep becomes only option for local replication.

ankitkala commented 2 years ago

We've gone ahead with the changes with an assertion that we'd not fetch the changes from pluggable translog if leader is using SegRep or Remote Store. This should help in reducing the combinations that we'll be supporting. If we later want to allow a particular combination, we can enable for it explicitly.