[FEATURE] Use Pluggable translog for fetching the operations from leader

saikaranam-amazon commented 2 years ago

What are you proposing? We're working on utilising Segment Replication for CCR and plan to make that as a default choice for replication. However, we aren't planning for deprecating the logical CCR as of now.

To support logical replication in the long run, we propose relying on Pluggable translog for fetching the operations for CCR(logical). More details here Here are the key points:

We'll rely on Translog Manager to provide the changes to be replayed on the follower.
We'll fetch the changes only from primary shard of the leader(except when leader is using logical replication).
For case where a user opts for no durability, we'll not support replication.
For cases where leader cluster has remote translog configured, we'll fetch the translog directly from the remote store.

Why support logical replication?

Incase customers don't want to use Segment based replication.
Due to potentially high NW usage with Segment Replication for cross region cases.
Incase, we want to support Active-Active replication.
Support for replication across different OS versions.

Local replication on leader	Translogs	Plans to be supported in near future	Source
Logical	Local	yes	primary and replica
Logical	Remote	no	primary
Logical	No-op	no	can fetch from lucene
Segment	Local	yes	primary
Segment	Remote	yes	primary's remote
Segment	No-op	yes	can't fetch

How did you come up with this proposal? Follow up from https://github.com/opensearch-project/OpenSearch/issues/1100.

What is the user experience going to be? Cross-cluster replication (CCR) simplifies the process of copying data between multiple clusters. Users can use CCR to enable a remote cluster for the purpose of Disaster Recovery or for data proximity.

Currently, CCR leverages logical replication to copy data from leader to the follower index. For which it fetches the operations on the leader index from the translog.

Benefits of doing this change While, the proposed feature does not fundamentally change the experience of CCR. It adheres to better design principles and best practices which will ensure compatibility with future engine changes. By moving the fetching operation to the pluggable translog it provides the opportunity to develop - active-active replication, replication between incompatible OpenSearch versions, and upgrades of leader or follower index without breaking ongoing replication.

The ability to fetch operations directly from the translog manager allows us to make CCR agnostic from the replication mechanism used inside the cluster as we also add support for segment replication. We'll also build support on top of it for remote translog.

Why should it be built? Any reason not to? This needs to be built so that we can keep supporting logical CCR 3.0 onwards Only reason not to support this would be if we want to solely rely on Segment replication for CCR.

What will it take to execute? Changes done in the PR should take solve the problem for now. In future, when we start relying on remote translogs, we'll need to add support for fetching the operations from leader shard's remote store directly.

What are remaining open questions? N/A

Is your feature request related to a problem? Provide extension point for Tlog fetch operations under OS Engine:

CCR needs these operations to address performance issues as detailed in - https://github.com/opensearch-project/OpenSearch/issues/1100. From 2.x, peer recovery moved to Lucene based on soft-deletes and Tlog fetch operations are deprecated. This issue tracks the tasks to expose extension point in core and use under CCR.

ankitkala commented 2 years ago

We do plan to move the method newChangesSnapshotFromTranslogFile out of Engine.

Since we intend to move to Segment Replication eventually, we decided not to build extension support for translog fetch operations. We'll instead move these methods to Pluggable Translog.

Since the pluggable translog has been moved out of 2.1 release to 3.0, we'll address this when the Pluggable Translog is ready.

ankitkala commented 2 years ago

Ideally we want to move to Segment Replication by 3.0 with most logic implemented in the core. But incase, we are not able to build that support in time or we still need logical replication, we'll utilise the PluggableTranslog to fetch the changes.

ankitkala commented 2 years ago

@nknize @dblock Let me know if you've ay concerns. We'll move the method newChangesSnapshotFromTranslogFile to pluggable translog which will go along with Pluggable translog in one of the 2.x release.

nknize commented 2 years ago

:tada: No concerns from me. This is great; thanks for opening this issue. Translog had been far too coupled with the Engine because of document replication. With the move to segment replication, not only is further decoupling the right move this will help us take the next step to promoting it to a first class mechanism so it can be optionally used based on a user's durability requirement.

dblock commented 2 years ago

cc: @mch2 @kartg @andrross

mch2 commented 2 years ago

No concerns here either. With segrep we will require the newChangesSnapshot method that isn't built off the xlog itself to fetch ops from the index, but not newChangesSnapshotFromTranslogFile.

ankitkala commented 2 years ago

@Bukhtawar I've listed down the combinations of local replication and translog on leader.

Local replication on leader	Translogs	Plans to be supported in near future	Source
Logical	Local	yes	primary and replica
Logical	Remote	no	primary
Logical	No-op	no	can fetch from lucene
Segment	Local	yes	primary
Segment	Remote	yes	primary's remote
Segment	No-op	yes	can't fetch

For the next release atleast, changes done in PR should suffice.

Eventually we'll need to these 2 additional handling on CCR side:

Only do load balancing if leader has local translog.
Throw validation failure if leader index has opted to no durability(i.e. no tranlogs)

ankitkala commented 2 years ago

Summarizing the problem statement and overall approach for clarity & visibility.

Problem statement: With OS 2.0, as part of index.soft_deletes.enabled deprecation, the newChangesSnapshotFromTranslogFile method was removed from core engine with assumption that its a dead code as OS core doesn’t rely on it. This method was added for CCR operations and removal of this method broke the CCR implementation. As follow up, valid concerns were raised around the dependency of CCR plugin on core OS methods which are implementation detail and aren't guranteed by OS to be always present.

Action Items: Here are the 3 follow up action items from this:

Move CCR to core module(issue).
Provide an extension point for 2.x releases(current issue).
Rely on Segment Replication for CCR(issue).

Plan for the action items: Here is our plan for the 3 action items(ordered in priority). Refer here for more details.

Segment Replication for CCR: Work in progress. We'll be re-using the existing CCR constructs as well as the module for local segment replication. Since we want to eventually move CCR completely to the core, plan is to keep all the transport level action into the OS core from the beginning so that CCR plugin doesn't rely on OS engine directly and only depends on these transport actions.
Move CCR to core module: tbh, this is the action item that'll finally address the concern mentioned above. CCR can't be directly moved to OS core right away due to coupling with security plugin. It will require a dedicated effort and the changes will also not be backward compatible. We'll revisit this after Segment Replication on CCR is done.
Provide an extension point for 2.x(and 3.x) releases: I'm not sure if there is a simple way to add an extension point in OS core which CCR can use to implement the methods for creating snapshot from translog. Unless someone has a better alternative, we propose adding a dependency on Pluggable Translog for the operations, where TranslogManager would expose the newChangesSnapshotFromTranslog method as part of its interface. This doesn't entirely removes the dependency of CCR on Core OS. But, it now adds an explicit dependency on Pluggable translogs where CCR(logical) will be one of the use cases supported by the Plugable Translogs. This change makes sense because even after CCR is moved to core, this dependency would remain unchanged.

Proposal for dependency on Pluggable Translog: Even with Segment Replication as default choice for CCR, we might not deprecate logical replication right away. Assuming the worst case scenario where we want to keep supporting the logical replication, we'll always have a dependency on the translogs for fetching these operations. With the proposed changes, TranslogManager would support the newChangesSnapshotFromTranslog method which CCR will rely for fetching the changes.

One caveat here is that CCR currently load-balances the requests between primary and replicas on leader for fetching the changes. This might not work with Pluggable Translogs. For example:

Incase of remote translogs, replica shards won't have translogs.
For SegRep with local translogs, replica would use WriteOnlyTranslogManager which by design would only support writes to translog(and not reads)
If Translog durability set to no translogs, primary as well as replica won't have the operations and CCR won't work.

Proposed changes:

Add newChangesSnapshot to the TranslogManager. InternalTranslogManager will support this method as expected. NoOpTranslogManager and WriteOnlyTranslogManager will throw the runtime exception on invoking this method which CCR will handle. We also explored returning an empty snapshot instead, but the exception handling will become tricky for CCR as we won't be able to distinguish between "valid empty snapshot" vs "error cases".
Support for load balancing requests between leader's primary and replicas:
- Option 1: Load balance requests only if leader is using logical replication (refer to the table above). We don't need to account for local logical replication with remote/no translogs.
- Option 2: Provide visibility from TranslogManager on the type of durability(local/remote/no-translog): We can expose an additional method called name. This can be an anti-pattern as in this case the interface won't be completely abstracting away the implementation details. Based on the configuration, CCR will:
- Block logical CCR if no translogs are present.
- Perform load balancing only for local translogs + local logical Replication
- Option 3: If the above 2 options aren't acceptable, we'll need to add additional logic on CCR which'll start with load-balancing solution but will resort to fetch from primary incase the replica on leader aren't able to fetch the changes.

ankitkala commented 2 years ago

@Bukhtawar Jotting down my thoughts based on our discussion today.

With Segment replication becoming the default choice for intra-cluster replication, It definitely makes sense for CCR to offer Segment Replication by default. Also, we do plan to work towards integrating with remote store for fetcing the segments directly from the store(instead of primary).
However, we still want to keep supporting logical cross-cluster replication and have not plans for its deprecation. We'd be relying on local translog present on the leader shard. Incase of remote translogs, we'll not rely on the getHistoryOperationsFromTranslog method added here and instead would fetch from remote store directly.

Adding folks for visibility. @krishna-ggk @saikaranam-amazon @nknize @elfisher @mch2 @sachinpkale @ashking94

krishna-ggk commented 2 years ago

Thanks @ankitkala for the thorough details. Going to focus comments on pluggable translog.

The final proposal seems fair. Couple questions

Is it fair to assume that the translog store can't be updated for active index? If so, CCR should perform the validation in the start API itself.

Incase of remote translogs, we'll not rely on the getHistoryOperationsFromTranslog method added https://github.com/opensearch-project/OpenSearch/pull/3948 and instead would fetch from remote store directly.

How are we thinking about the security model here?
Have we evaluated primary/replica leader shards proxying the call to remote translog in which should support load-balancing pull requests?

Bukhtawar commented 2 years ago

Thanks @krishna-ggk

How are we thinking about the security model here?

The model could be same as the current flow. The way I envision this happening is decoupling the control flow and data flow. The control flow can still be the way it exists today with the existing security model, like follower fetching the leader's checkpoints while the data flow would latch on directly to remote store with a new data security model. The advantage with this is, we retain most of what exists today while not overloading N/W bandwidth on leader cluster.

Have we evaluated primary/replica leader shards proxying the call to remote translog in which should support load-balancing pull requests?

If would prefer a decoupled control and data flow to fetch data directly from remote translog store as described above.

For additional context Control flow : Follower -> Leader : Leader cluster just gives metadata or checkpoints not the actual data Data flow: Follower -> Remote store : Actual data transfer happens directly from the remote store

krishna-ggk commented 2 years ago

Thanks for expanding @Bukhtawar .

Yes, agree on the benefits of directly querying remote store. Like you pointed out, the security model for the data flow would be key. The abstraction need to support varied stores with different permission model. This also raises back @ankitkala 's question on whether we need to expose the translog type (Remote/None/Local) or we can find a model that works across (the latter preferred ofcourse).

ankitkala commented 2 years ago

The way I envision this happening is decoupling the control flow and data flow. The control flow can still be the way it exists today with the existing security model, like follower fetching the leader's checkpoints while the data flow would latch on directly to remote store with a new data security model. The advantage with this is, we retain most of what exists today while not overloading N/W bandwidth on leader cluster.

Yep. This completely aligns with how I'm also thinking about security for fetching directly from leader's remote store (similar approach for Segments as well).

This also raises back @ankitkala 's question on whether we need to expose the translog type (Remote/None/Local)

I'm slightly inclined towards having the translog type exposed but want to see inputs from @Bukhtawar first. One benefit that this provides is that CCR can be deterministically aware and figure out the mechanism needed to fetch the operations.

Incase we don't want to expose the translog type from the TranslogManager interface, then I see these 2 potential options:

Responsibility will lie with CCR plugin to figure out(via Settings and FeatureFlag)(not preferred) whether translog are local/remote/none and operate accordingly.
Alternatively, the logic to fetch from leader's remote store can reside in the PluggableTranslog as CCR would just rely on Pluggable Translogs instead (unless we're thinking about extensibility in PluggableTranslog which might be a overkill IMHO)

Bukhtawar commented 2 years ago

Eventually we'll need to these 2 additional handling on CCR side: Only do load balancing if leader has local translog. Throw validation failure if leader index has opted to no durability(i.e. no tranlogs)

Can you please confirm if we need fetching operations from translog from leader cluster if segrep is enabled? Can we totally avoid the call in that case

rohin commented 2 years ago

What is the user experience going to be? Cross-cluster replication (CCR) simplifies the process of copying data between multiple clusters. Users can use CCR to enable a remote cluster for the purpose of Disaster Recovery or for data proximity.

Currently, CCR leverages logical replication to copy data from leader to the follower index. For which it fetches the operations on the leader index from the translog.

While, the proposed feature does not fundamentally change the experience of CCR. It adheres to better design principles and best practices which will ensure compatibility with future engine changes. By moving the fetching operation to the pluggable translog it provides the opportunity to develop - active-active replication, replication between incompatible OpenSearch versions, and upgrades of leader or follower index without breaking ongoing replication.

The ability to fetch operations directly from the translog manager allows us to make CCR agnostic from the replication mechanism used inside the cluster as we also add support for segment replication and the location of the translog.

ankitkala commented 2 years ago

Eventually we'll need to these 2 additional handling on CCR side: Only do load balancing if leader has local translog. Throw validation failure if leader index has opted to no durability(i.e. no tranlogs)

Can you please confirm if we need fetching operations from translog from leader cluster if segrep is enabled? Can we totally avoid the call in that case

If CCR is using SegRep, we won't even fetch operations from translog. But if CCR is on logical(and leader on SegRep), we'd still fetch the operations from translog.
By default, we are not thinking of giving this as an option to the customer. But the intention is to keep the support for CCR logical intact even if SegRep becomes only option for local replication.

ankitkala commented 2 years ago

We've gone ahead with the changes with an assertion that we'd not fetch the changes from pluggable translog if leader is using SegRep or Remote Store. This should help in reducing the combinations that we'll be supporting. If we later want to allow a particular combination, we can enable for it explicitly.

opensearch-project / cross-cluster-replication

[FEATURE] Use Pluggable translog for fetching the operations from leader #375