opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.47k stars 1.74k forks source link

[Feature] Cross Cluster Replication based on segment replication #3020

Open krishna-ggk opened 2 years ago

krishna-ggk commented 2 years ago

Is your feature request related to a problem? Currently CCR implements logical document replication where in operations from translog are replayed onto follower cluster. However with segment replication coming into OpenSearch core, basing CCR on it has potential to bring in huge benefits such as speeding up replication, reduced memory/CPU etc.

What solution would you like? Implement CCR based on segment replication.

What alternatives have you considered? Haven't considered really - but there is an RFC for pluggable translog. It would be good to investigate if there are any strong reasons for having logical replication (instead of physical segment replication) and if so consider if basing CCR on pluggable translog has any merits to the idea.

Do you have any additional context? Need to ensure the RFC considers following

  1. Can it support filter replication? (Seems like we may be able to leverage FilterLeafReader ) Is it required?
  2. Can an external durable pluggable store like S3 be used instead of directly replicating?
  3. How would mapping changes etc be propagated?
  4. Would there be a migration path from previous implementation?

There is also ton of useful discussion on this topic at 2482

cc: @nknize @mikemccand

krishna-ggk commented 2 years ago

Continuing feedback from OpenSearch-2482

Ability to filter out documents to be replicated

Just dropping some thoughts out loud on this one:

This sounds like a good use-case fit for the streaming index api (future home for the xlog and user defined durability policies). One idea is to have CCR filters in the streaming index api route copies of indexing requests to the follower cluster if filter is matched. Another idea, there could be co-located IndexWriters in the streaming index api of the leader cluster that filter the documents and flush follower segments to be replicated by segrep of the follower cluster only.

In the former case you're using a form of docrep for inter-cluster replication but paying the penalty for additional network traffic (and possible failures). In the latter you're parallelizing the writes on the leader node and using segrep for inter and intra-cluster replication. Incidentally this also provides the "fail early" protection we have today by indexing in memory and only replicating when flushing.

Interesting thought @nknize . Docrep seems to keep the implementation simple while segrep will likely require maintaining two flavors of segment (one meant for leader's own use while other meant for follower). There could be more flavors of segments to be maintained on the leader if there are more followers and corresponding book-keeping. Essentially it seems like building filtering support with segrep could add tradeoff on storage and complexity.

Lucene has a powerful FilterLeafReader that works really well for this.

This is true! I wonder then, w/ segment replication is there even a need for CCR? Couldn't we just snapshot in the leader cluster and restore in the follower and use FLR on the follower to filter unneeded docs? Is there a use case I'm missing where CCR is needed at all if we're using a segment level replication design? Seems more appropriate for document level.

+1 to explore this.

Let the cloud storage layer (S3 or whatever) handle the geo-distribution of the bits. Snapshot in one geo and restore in the other.

@nknize @mikemccand

Leveraging cloud storage to durably replicate at scale is definitely a great option to evaluate. Infact I don't see this as all OR nothing. The underlying implementation can leverage cloud storage instead of directly replicating segments from nodes.

The reason we would still need a CCR layer is mainly to have a hot standby cluster ready to take over with guaranteed assurance when disaster strikes and simplify CCR management via APIs. Underneath cloud storage can be an intermediary to avoid direct dependency. Further CCR also takes care of replicating metadata, aliases and provides APIs that expose stats, facilitate index level replication etc

It would be interesting to investigate continuously replicated "cold-replica" where in the cloud storage has everything ready in the geography of follower upto the point of time disaster stuck. This could help solve cost concerns at the cost of increased time to recovery.

ankitkala commented 2 years ago

Just to add, we'll also track the changes for CCR as core component as part of this issue: https://github.com/opensearch-project/OpenSearch/issues/2872