[Feature Request] Introduce PULL replicas with segrep

Bukhtawar commented 8 months ago

Is your feature request related to a problem? Please describe

With a large cluster setup and replication group containing a lot of read replicas, no-op replication might become an overhead. Typically we would need replica shards for write availability and pull-replica shards for read/write availability So essentially we can have these two variants

Primary shard - That uploads translogs/segments to S3 and publishes checkpoint
Replica shard - That undergoes no-op replication and be promoted to primary on failover, while continuing to receive replication checkpoint and downloads segments
Pull-replica shard - That doesn't undergo no-op replication cannot get promoted to primary on failover but continue to receive replication checkpoint and downloads segments

Describe the solution you'd like

We don't need to perform no-op replication for operation level durability on all copies but only a limited sub-set of failover copies to ensure they can become failover primaries.

Related component

Indexing:Replication

Describe alternatives you've considered

Separate reader & writer is one option but would require significant infra changes

Additional context

No response

backslasht commented 8 months ago

Would the read replicas goes out of sync in the absence of the no-op replication? Any thoughts on how to keep them in sync eventually (with some delay)?

Bukhtawar commented 8 months ago

Would the read replicas goes out of sync in the absence of the no-op replication? Any thoughts on how to keep them in sync eventually (with some delay)?

No they won't publication checkpoints would still go through, on refreshes/segment creation and with segrep the PULL replicas will download the new segments. If for some reason it isn't able to and replication lag increases, we would fail the shard copy, status quo

backslasht commented 8 months ago

Sounds good. This is a good optimization which comes into play when the number of replicas are high, we should pursue this.

andrross commented 8 months ago

no-op replication might become an overhead

What is the overhead here? My understanding is that this is a super lightweight message, with its only goal to ensure that the node sending the message is in fact still the primary. Could we achieve this optimization without introducing a new type of replica shard? I'm thinking of something like revisiting the "ensure I'm still the primary" check with something like "send no-op replication message to max(3, num_replicas) replicas" or "broadcast no-op replication to all replicas but continue once a quorum of responses have been received". These are off-the-top-of-my-head suggestions and may be very bad, but my broader point is whether we should focus the optimization on the no-op replication logic as opposed to introducing a new replica shard type.

linuxpi commented 7 months ago

Feels like a smart tradeoff to solve the search freshness lag with segpre. I am guessing we would be introducing a new search preference strategy as well ? to ensure PULL replicas are preferred for search?

Bukhtawar commented 7 months ago

What is the overhead here?

Every indexing request has to necessarily touch all copies in a way that couples write and read availability. With large number of replicas this would cause bulk latencies to increase as one of the copy goes slow. To ensure we improve tail latencies and reduce chattiness we need a mechanism to identify certain replicas as PULL replica to ensure only those can get promoted to primary. So indexing and failover flows both need to understand the protocol. I am all in for a simplified way to accomplish the same

opensearch-project / OpenSearch