[RFC] Segment Replication

anasalkouz commented 3 years ago

Propose an improved shard replication strategy (segment replication) and outline the advantages it offers over the current method (document replication).

muralikpbhat commented 3 years ago

Thanks for creating this. I would like to start with initial requirements/benefits of segment level replication.

OpenSearch today uses document level replication which synchronously writes each document to the translog on the replica node. While this is needed for durability and high write availability, it reduces the throughput of the cluster to half and also is expensive (cost and performance) for customers who don't need the second copy for querying. Due to the synchronous communication between primary and replica, there are more failure points as well. We have seen cluster wide impact of a slower replica node.

Lucene supports segment replication (called NRT). However, OpenSearch has not adopted this. OpenSearch has transaction support above Lucene which currently uses translog for durability and that needs to be synchronously written to replica node. Copying the segments asynchronously to the replica node and then throwing away the synchronous translog later could be an approach to solve this throughput problem. The translog is used only incase of primary failure for recovery. However this solution doesn't solve the problem of failure mode between primary and replica node since translog has to be written. There could also be concerns with data being copied twice (translog+segments) and new segments again after any merges. But for many clusters, network is not the bottleneck during indexing. There are also considerations for shard allocation (currently primary and replica gets the same weight) and tradeoff for refresh when we use segment replication.

Some systems which are built above Lucene uses a different approach instead of translog between the nodes, where they keep the documents in the external queue for durability and only delete it later once the segments are persisted to both the nodes. This avoids the need for inter node resiliency in sync path, but brings in the requirement of an external Queue. As you see the documents are not durable in the systems and it requires clients to re-drive from previous checkpoints which is cumbersome. There might be an approach of storing the segments and translog from primary into a durable storage outside of OpenSearch, where the replica nodes can read it from.

Just listing down the top level requirements that came to mind:

Double the indexing throughput on the cluster (by avoiding document indexing in replica node)
No impact to the ingestion latency
Automatic failover in case of primary failure, MTTR of 30seconds
Durability even incase of multiple node loss or replica=0 cases (nice feature if we have remote storage)
Peer recovery from remote storage to avoid the overload on peer
Point in time restore support from remote storage instead of periodic snapshot
Cross cluster replication to leverage this segment replication
Avoid synchronous node to node communication in the ingestion critical path
Isolation/correctness guarantees (if using shared storage)

itiyamas commented 3 years ago

Why is this a requirement/benefit for segment level replication? "Avoid synchronous node to node communication in the ingestion critical path"

For durability and high availability, the system has to write data to multiple nodes- else if the node goes down, the data is lost. Even the approaches where the log is written to durable store outside of Opensearch node (whether it is written before the request reaches the node or later)- the durable store will internally write to multiple nodes in the sync path. We should say that we are offloading that problem to some other system instead of solving it in Opensearch. Once the durability is offloaded to a different system, Opensearch indexing node can independently build the index on different replicas(this is analogous to document level replication) or create replica at one place and copy them to different nodes(segment replication).

I think the problem that you are referring to is that a single bad node results in slow ingestion. Now, why do other durable systems don't have this problem? Because they use quorum based writes instead of synchronously writing to all replicas. That way, even if a single node is bad- it is inherently taken out of the quorum. If we implement quorum based writes- the slow node issue would go away, irrespective of what replication mechanism you use. Implementing this in Opensearch is not trivial and should be an independent change.

Another requirementwe should add is whether or not segment replication should be supported with the same set of indexing APIs or if we want to introduce new async APIs that are more suited for this replication model?

I think the approaches where durability is pushed to an external queue, the system becomes asynchronous and clients cannot use the same synchronous API integrations used with document replication. e.g. validating whether a document exists or whether an update can be applied on the document occurs in the sync path for bulk APIs. Similarly, a checkpoint and replay mechanism has to be built on top of the durable store to re-drive missed transactions. This requirement will help in making the decision on transaction log implementation easier.

That being said, I like this approach where log and indexer is decoupled as it helps with high write availability. The log can be replicated on multiple machines(via an external store). A single indexer node reads the log from the external store and takes care of building the index and then replicating it to other replicas(or some remote store). When indexer goes down, the client can continue to write to the log till indexer node comes up. In other approaches, where a single node manages both log and lucene index, client writes are impacted when the single indexing machine goes down as the other copies need to build the index via translog replay before accepting any more writes, thereby degrading the failover experience as compared to document level replication.

Another thing that I like with log being pushed to external store is that the log read path is continuously exercised by indexer, as opposed to other approaches where recovery is exercised only when a node goes down, which introduces bimodal behaviour in the system.

Rebalance considerations need to be worked out in detail.

In the document replication model, replica continues to consume the same resources even after failover, while segment replication model causes an abrupt increase in replicas’ resource consumption when it becomes primary. In an ideal world with optimum resource usage, replica will be placed on a node which budgets less resources to it than needed after it switches over. When failover happens, replica suddenly becomes primary and starts consuming more resources, without any reduction in resource consumption due to other shards on the same node. This results in nodes receiving much more load than they could handle and would start rejecting requests. This is an unavoidable scenario in segment replication replication due to abrupt change in resource consumption and less availability of system resources during node drops. The logical replication, in contrast, continues to operate with the same performance characteristics when node drops happen. Even after the capacity is available, it might take a lot of data movement to rebalance shards.

nknize commented 3 years ago

There is a companion issue to discuss a "pluggable durability framework" so we could refactor the translog implementation as a module. This would enable using alternative "journal / writeahead logs" based on existing solutions such as kafka, etc. and not saddle OpenSearch to the translog implementation. Here are some additional goals / requirements I'd like to see:

user selectable recovery options (translog, kafka, etc)
extendable interfaces for community to implement new recovery mechanisms
lean API surface area (keep the number of "knobs' at bay)
separable search / index processes for "append only" type use cases (e.g., log analytics)

muralikpbhat commented 3 years ago

Major challenge with Segment level replication is how do we handle that durability that translog brings today. If we still continue to do the current translog implementation, it doesn't solve all the requirements we discussed above.

Current model: Client -> Coordinator -> Primary -> Replica (sync)

translog is written in sync fashion
segments are written in async fashion

Below are the 3 options that we thought through(there may be more), each of which has pros/cons:

Replace current translog with durable translog Client -> Coordinator -> Primary -> Shared Persistent translog. Replica copies segments from primary, recovers from shared translog incase of primary failure a. Doesn’t the change current api behaviour and guarantees b. Primary failover will be slow since it has to replay the log and hence impacts write availability for update or lookup cases
Introduce a persistent stream in-front of primary Client -> Coordinator -> Shared Persistent Stream -> Primary. Primary will pull docs and creates the segments. Replica copies segments from primary and replays from Persistent Stream incase of Primary failure. a. Decouples customer writes from indexer failures. Ingestion can continue.(no impact to availability) b. API behaviour is changed since the docs might still in be flight. c. Updates is hard since previous doc might still be in stream. Update API need to be asynchronous since validations and OCC etc cannot be done.
Push the durability responsibility to the client side Client -> Persistent Stream -> Coordinator -> Primary. Replica copies segments from primary. Client uses another api to see if the docs are durable and re-drives incase it is not. a. Bulk api remains same, but need more apis for replay and checkpoint functionality. b. Checkpointing has to happen in client system and it needs to know when and from what point to replay. c. Updates has to be handled async

muralikpbhat commented 3 years ago

Irrespective of what options we decide, we will have the following work tracks in common(will add additional tracks based on what we decide):

Segment replication - informing the replica nodes on changes in primary and replica nodes pulling the segments
Extension point for configuring the source of Segment(primary, any peer vs remote storage)
Peer recovery - changes to ensure it doesn't rely on translog
Handle segment Merges - should trigger the segment copy again and corresponding deletions of older segments.
Translog abstraction - option to disable, or configure in-cluster or outside cluster translog [#1277]
Shard rebalancing/allocation - Primary does more work compared to replica and hence the current balancing that OpenSearch does based on just shard count has to be changed.
CCR replication to use this segment replication [Not a prerequisite]
Snapshot PITR [Not a prerequisite]
Cloud specific extensions for translog [Not a prerequisite]
Cloud specific extensions for remote replication [Not a prerequisite]

kartg commented 2 years ago

RFC published as #1694

opensearch-project / OpenSearch

[RFC] Segment Replication #1123