Use pull based model for Segment Replication

ankitkala commented 1 year ago

Pull based model for Segment Replication We're working on using the existing implementation of Segment replication for cross cluster replication. We've identified 2 major areas in the existing Segment Replication design which doesn't work well for cross cluster replication. Technically we can get around these by overriding those part of implementation just for CCR, but wanted to check if it is possible to maintain a single solution that works for both cases.

Main crux of the issue is that current approach relies on a 2 way communication where primary can talk to replica and the vice versa. This doesn't work well for cross cluster use cases as we establish uni-directional connections where follower polls the leader and get all the data. We did explore whether we can create bi-directional cross cluster connections but decided to go against it.

Why not bi-directional connection for CCR? Pros:

Can re-use Segment Replication without additional changes.
Can allow follower to react to changes on leader (instead of polling)(e.g. listen to leader index events like close). But might put additional overhead on leader to communicate with followers (specially with multiple followers).

Cons:

Breaking change from the existing connection mechanism which is currently used for cross-cluster search as well as replication.
Bidirectional connection makes version upgrades tricky.
- With current unidirectional connection, we recommend the users to upgrade follower first to ensure that CCR doesn't break during upgrade. With introduction of bi-directional connections, both clusters can be leader for different indices which will result in cyclic dependencies. Thus user can't upgrade the clusters without halting the replication.
From CCR long term perspective, it doesn't add substantial value for us to justify a major overhaul of the current design. IMHO, it'd be much easier to modify the Segment Replication design unless there is a strong reason not to do so.

These are the 2 changes that would align the existing Segment Replication implementation with CCR. Kudos to the folks working on the original design who ensured such implementation swap can be done easily without modifying other other sub-components.

1. Use a pull mechanism for downloading the segments (instead of push) Current implementation for Segment Replication relies on the peer recovery constructs where it pushes the segments from primary's node to replica's node using the MultiChunkTransfer by invoking write chunk on the replica's node. We can change the direction here where replica requests each chunk from leader and stores locally. This might require some additional book-keeping on the replica side but should be do-able. We're already doing something similar for CCR where we fetch the segment from leader cluster by treating it as a snapshot repository here Also, this align with the remote store integration where replica would need to pull the segments from primary's remote store.

2. Use a polling mechanism on replica instead of listeners on primary refresh Instead of listening to refreshes on Primary, we can model it as a periodic polling where each replica polls for the latest checkpoint and trigger the replication event. The change in itself should be simple to do but we might want to do a benchmark comparison to ensure that there is no regression.

mch2 commented 1 year ago

@ankitkala Thanks for raising this. Imo the right idea long-term is to let users configure the behavior based on need & add support for a pull model. However, with segrep having two sources of copy in the near future (primary & remote store), is it worth adding a pull model with the primary only implementation of CCR now or should we start with CCR+segrep support only with a remote store?

Bukhtawar commented 1 year ago

Thanks @mch2 I think we should also think about how the interactions would look like if we just had remote store integration with primary(even outside CCR use case). Wouldn't that inherently be a pull based mechanism, where replicas poll for checkpoints and download segments diffs. So I think pull based mechanism will inherently be used even in local cluster with remote store. Now the only place we would be doing a push is in-cluster segrep with a local store. That's something we need to analyse if there are any known implications of changing that to pull. Let me know if we've done a similar analysis to conclude one way or the other

gbbafna commented 1 year ago

How are we envisioning OpenSearch users to use the SegRep feature :

in-cluster segrep with a local store.
in-cluster segrep with remote store

If the ~first~second one is the way forward for most of them , we might not need to solve for problems like https://github.com/opensearch-project/OpenSearch/issues/4245

ankitkala commented 1 year ago

is it worth adding a pull model with the primary only implementation of CCR now or should we start with CCR+segrep support only with a remote store

@mch2 For CCR, SegRep with remote store is definitely the first choice for us. Whether we want to support only one combination or both needs to be decided(will bring this up during feature brief). But if we want to keep the option to support segrep without remote store open, it might be worthwhile to do these refactors early on.

For local SegRep, are there any pros of not relying on remote store? The only reason i can think of is the additional effort required to setup the repository which might affect the adoption.

ankitkala commented 1 year ago

How are we envisioning OpenSearch users to use the SegRep feature :

in-cluster segrep with a local store.

in-cluster segrep with remote store

If the first one is the way forward for most of them , we might not need to solve for problems like #4245

Isn't it other way round? With remote store, we won't have to solve for the network bandwidth.

gbbafna commented 1 year ago

@ankitkala : Thanks , corrected my comment.

For local SegRep, are there any pros of not relying on remote store? The only reason i can think of is the additional effort required to setup the repository which might affect the adoption.

Most of the users already use snapshot for durability. So setting up repository shouldn't be a hindrance IMO. Cost might be a reason not to rely on remote store . @mch2 @Bukhtawar would know better on this .

I prefer that there are few modes for OpenSearch to operate for ease of operations and maintenance.

ankitkala commented 1 year ago

Reviving this discussion again since now we've been thinking towards segments and remote store integration. To disambiguate the discussion, the directionality(push or pull) can be thought either in terms of data flow or control flow.

Control flow Control flow would determine the trigger point for replicas to fetch the changes.

Pull based A pull based approach would rely on a polling mechanism where replica keeps polling(leader node or remote store) for new changes. Pros:

Complete reader write separation.

Cons:

Excessive Polling can be an issue if there are large number of shards. Similarly, we'd also be polling redundantly for primaries which aren't taking any active write traffic.

Push based A push based approach would imply using event driven mechanism where primary shard would notify the replica after the new segments are available(i.e. refresh incase of local segrep and segemnt uploaded to remote store for segrep with remote store).

Pros:

Simplified and deterministic behaviour for when the replica fetches the new changes.
Not a one way door. We can easily flip to a pull if required in future.

Cons:

Replica can fall behind if it can't process(or misses) a particular notification. Though not ideal, this should automatically resolve with next refresh in most cases. Another variant of such failure is what was observed here with relocation and wait_until requests.
Not complete separation of reader/writer. The interaction between primary(writer) and reader(replica) is minimal though(with remote store) and definitely can be fixed eventually (for example with a messaging queue).

Data flow: For data flow,

Pull based model aligns with segment replication using remote store where replica directly downloads the segments from remote store.
For local segrep cases, the current approach is push based which is how the peer recovery is implemented as well. It can be moved to a pull based approach but there aren't any benefits IMO.

Compatibility with CCR: Talking about the primary use case which is using segment replication with remote store, CCR can keep parity with the approach for local segrep with remote store, where primary sends a notification to the replica/follower and then replica/follower downloads the new segments from remote store.

One major blocking issue can the the bi-directional nature of the communication between leader and follower. As of now, we've not been thinking about CCR without remote store, so the data flow of local segrep shouldn't be an issue.

Just to conclude, we should still keep the control flow a push based mechanism. We'll need additional handling for cases like wait_until though. Data flow for segrep with remote store will be a pull based mechanism. For local segrep cases, we can explore whether we want to align it with segrep + remote implementation but making it pull, but i don't see enough ROI and is definitely not urgent to picked up.

anasalkouz commented 1 year ago

Closing this, we don't see a need for this at the moment.

Bukhtawar commented 2 weeks ago

Re-opening this as we want to evaluate and streamline mechanisms across local and remote replicas

opensearch-project / OpenSearch

Use pull based model for Segment Replication #4577