Open gbbafna opened 1 year ago
@gbbafna Thanks for writing this up! Couple thoughts:
We are assuming the end state to be SegRep and Remote Store enabled and not just SegRep enablement. This is just to reduce the modes we need to support to start with . We can provide migration to just SegRep as an incremental feature, which would reuse most of the components designed here.
It makes sense to start with node-node, but with the lower level components abstracting away the source of replication I think the complexity is mostly in configuration. How are you envisioning the conversion being initiated? We would likely need a new API here to go from DocRep -> SegRep w/ remote storage to properly update all settings.
SegRep to DocRep Migration
Until remote store + DocRep is supported as a standalone feature I think its reasonable that conversion from SegRep with remote store back to docRep would remove remote store capabilities? With that said, I think it would be wise to support this first. If a user switches to SegRep and wishes to revert for whatever reason the only option would be a reindex. Also, complexity wise I think this would actually be a fairly trivial engine swap on replicas.
The challenge here is to make primary understand both SegRep and DocRep. We will also need to store replica's property durably . Primary will send checkpoint update to SegRep based indices and documents to DocRep based indices.
Currently we are sending all docs to SegRep based indices for durability. Are you referring to remote translog case?
In general, for DocRep -> SegRep I think the approach of rolling restarts of replica engines is the right one. I'd imagine we would need a full recovery here so that the shard is not serving stale reads until it catches up. Would be great to do this without triggering any reallocation/failing the shard but I don't think is something that exists today. An alternative here is to fetch the required segments from primary's latest cp and write to a separate directory, but this would likely not be feasible with disk constraints.
Thanks @mch2 for the review and feedback .
It makes sense to start with node-node, but with the lower level components abstracting away the source of replication I think the complexity is mostly in configuration. How are you envisioning the conversion being initiated? We would likely need a new API here to go from DocRep -> SegRep w/ remote storage to properly update all settings
Yes, the initial idea was an API which would trigger an FSM and might need to store the details in cluster state as well .
Until remote store + DocRep is supported as a standalone feature I think its reasonable that conversion from SegRep with remote store back to docRep would remove remote store capabilities?
Yes .
With that said, I think it would be wise to support this first. If a user switches to SegRep and wishes to revert for whatever reason the only option would be a reindex. Also, complexity wise I think this would actually be a fairly trivial engine swap on replicas.
Agreed . Once we have all the details hashed out and POC done , we might do this in first phase as well .
Currently we are sending all docs to SegRep based indices for durability. Are you referring to remote translog case?
I am referring to the case, where we are hydrating the replica from primary segments. Since it is going to take a good amount of time as it is full recovery , the solution is not durable for 1 replica indices.
. An alternative here is to fetch the required segments from primary's latest cp and write to a separate directory, but this would likely not be feasible with disk constraints.
This is what we explored as well. But due to disk constraints , we didn't list it out here.
Goal
OpenSearch will be launching remote store feature and has already GA'd Segment Replication. However this replication method and durability enhancement is only available for the newer indices. The next step is to support migration of existing indices into SegRep and Remote Store enablement.
Requirements
Functional
Non-Functional
Non-Requirements
Potential Approaches
[Recommended]
Rolling restarts of replica copies
Here we restart replicas one by one . The challenge here is to make primary understand both SegRep and DocRep. We will also need to store replica's property durably . Primary will send checkpoint update to segrep based indices and documents to docrep based indices.
ToDo : Exploration is still ongoing in this.
Enabling Remote Store & Remote Translog followed by SegRep Enablement
We would support remote segment store and translog for DocRep indices . This will give us ability to store data durably even with writing to one copy of data. Proposed migration steps would be executed in a FSM . Below are the proposed high level details . More details will be covered in a separate issue.
Alternative Approaches
Bringing new replica copies w/o remote store
Following steps could help migrate index :
The con in this approach is regression in durability and availability guarantees. During the times where new replica is coming up , shards are left with only 1 copy .
Using Remote Store for Async durability
Using Remote Translog Store for durability
We can’t just use Remote Translog for durability. It needs to be supplemented with Remote Segment Store. Hence this is not feasible.
Comparison
ToDo
Potential Issues
ToDo
Next Steps