[RFC] Supporting existing indices migration to SegRep and Remote Store

gbbafna commented 1 year ago

This feature proposal is WIP. We will continue to add details to Sections that are marked with ToDo.

Goal

OpenSearch will be launching remote store feature and has already GA'd Segment Replication. However this replication method and durability enhancement is only available for the newer indices. The next step is to support migration of existing indices into SegRep and Remote Store enablement.

Requirements

Functional

Data integrity - No data loss due to the migration process itself.
High availability - Index should be available for writes and reads during the migration .
High durability - We should not regress on durability guarantees provided by existing configuration.
Failovers & Recovery - Continue as normal.

Non-Functional

Minimal disruptions to write - As we reload engines in place , we will need to hold on writes . We should strive to make them minimal .
No extra capacity requirement - We shouldn’t need extra nodes/space for the migration .
Disruptions to read - Minimal disruptions to read capacity .

Non-Requirements

We are assuming the end state to be SegRep and Remote Store enabled and not just SegRep enablement. This is just to reduce the modes we need to support to start with . We can provide migration to just SegRep as an incremental feature, which would reuse most of the components designed here.
SegRep to DocRep Migration - This is also not covered as part of this feature. However this is also an incremental feature , which could be considered later on if needed .

Potential Approaches

[Recommended]

Rolling restarts of replica copies

Here we restart replicas one by one . The challenge here is to make primary understand both SegRep and DocRep. We will also need to store replica's property durably . Primary will send checkpoint update to segrep based indices and documents to docrep based indices.

ToDo : Exploration is still ongoing in this.

Enabling Remote Store & Remote Translog followed by SegRep Enablement

We would support remote segment store and translog for DocRep indices . This will give us ability to store data durably even with writing to one copy of data. Proposed migration steps would be executed in a FSM . Below are the proposed high level details . More details will be covered in a separate issue.

Enable Remote Store and Remote Translog for DocRep
1. Seed Remote Store first - This ensures subsequent refresh doesn’t time out .
2. Take all the permits on all primary shards
3. enable remote store and remote translog on primary .
4. release the permits to enable writes on all primary shards.
Enable SegRep :
1. Decrease replica to 0
2. Take all the permits on primary
3. Reload the primary engine as InternalEngine with SegRep and Remote Store integration with SegRep enabled
4. Enable the writes.
5. Increase replica count to previous value

Alternative Approaches

Bringing new replica copies w/o remote store

Following steps could help migrate index :

Set replica to 0 .
Take all the permits on primary
Reload primary engine to do SegRep .
Release the permits to enable writes.
Sets replica to previous value .
Enable Remote Translog Store , followed by Remote Segment Store.

The con in this approach is regression in durability and availability guarantees. During the times where new replica is coming up , shards are left with only 1 copy .

Using Remote Store for Async durability

Seed Remote Store first - This ensures subsequent refresh doesn’t time out .
Enable Remote Store for DocRep
1. Take all the permits on all primary shards
2. enable remote store on primary , by reloading the engine.
3. release the permits to enable writes. on all primary shards
4. Wait for all shards to.
Enable SegRep :
1. Decrease replica to 0
2. Stop the writes.
3. Reload the primary engine as InternalEngine with SegRep enabled.
4. Enable the writes.
5. Increase replica count to previous value
Enable RemoteTranslog

Using Remote Translog Store for durability

We can’t just use Remote Translog for durability. It needs to be supplemented with Remote Segment Store. Hence this is not feasible.

Comparison

ToDo

Potential Issues

ToDo

Next Steps

POC to check feasibility of enabling Remote Translog and Remote Segment Store on DocRep based indices.
POC to create FSM based migration of indices while acquiring permits.
Deeper exploration for Rolling restarts of replica copies .

mch2 commented 1 year ago

@gbbafna Thanks for writing this up! Couple thoughts:

We are assuming the end state to be SegRep and Remote Store enabled and not just SegRep enablement. This is just to reduce the modes we need to support to start with . We can provide migration to just SegRep as an incremental feature, which would reuse most of the components designed here.

It makes sense to start with node-node, but with the lower level components abstracting away the source of replication I think the complexity is mostly in configuration. How are you envisioning the conversion being initiated? We would likely need a new API here to go from DocRep -> SegRep w/ remote storage to properly update all settings.

SegRep to DocRep Migration

Until remote store + DocRep is supported as a standalone feature I think its reasonable that conversion from SegRep with remote store back to docRep would remove remote store capabilities? With that said, I think it would be wise to support this first. If a user switches to SegRep and wishes to revert for whatever reason the only option would be a reindex. Also, complexity wise I think this would actually be a fairly trivial engine swap on replicas.

The challenge here is to make primary understand both SegRep and DocRep. We will also need to store replica's property durably . Primary will send checkpoint update to SegRep based indices and documents to DocRep based indices.

Currently we are sending all docs to SegRep based indices for durability. Are you referring to remote translog case?

In general, for DocRep -> SegRep I think the approach of rolling restarts of replica engines is the right one. I'd imagine we would need a full recovery here so that the shard is not serving stale reads until it catches up. Would be great to do this without triggering any reallocation/failing the shard but I don't think is something that exists today. An alternative here is to fetch the required segments from primary's latest cp and write to a separate directory, but this would likely not be feasible with disk constraints.

gbbafna commented 1 year ago

Thanks @mch2 for the review and feedback .

It makes sense to start with node-node, but with the lower level components abstracting away the source of replication I think the complexity is mostly in configuration. How are you envisioning the conversion being initiated? We would likely need a new API here to go from DocRep -> SegRep w/ remote storage to properly update all settings

Yes, the initial idea was an API which would trigger an FSM and might need to store the details in cluster state as well .

Until remote store + DocRep is supported as a standalone feature I think its reasonable that conversion from SegRep with remote store back to docRep would remove remote store capabilities?

Yes .

With that said, I think it would be wise to support this first. If a user switches to SegRep and wishes to revert for whatever reason the only option would be a reindex. Also, complexity wise I think this would actually be a fairly trivial engine swap on replicas.

Agreed . Once we have all the details hashed out and POC done , we might do this in first phase as well .

Currently we are sending all docs to SegRep based indices for durability. Are you referring to remote translog case?

I am referring to the case, where we are hydrating the replica from primary segments. Since it is going to take a good amount of time as it is full recovery , the solution is not durable for 1 replica indices.

. An alternative here is to fetch the required segments from primary's latest cp and write to a separate directory, but this would likely not be feasible with disk constraints.

This is what we explored as well. But due to disk constraints , we didn't list it out here.

opensearch-project / OpenSearch