zalseryani commented 6 months ago

Describing The Case.

inserting 1000 message/docs/data in leader production index.
configuring replication between production and dr.
verifying that replication is fully completed. (leader and follower are in sync)
taking snapshot from leader index.
inserting 1000 message/docs/data in leader index.
restoring snapshot taken at 1000 messages (not 2000) on the same index (after closing it) --> simulating point-in-time recovery or an upgrade failure since the documentation states if upgrade failure occurs, there is no way to downgrade opensearch cluster but to only create a new cluster and restore from the latest snapshot taken before the upgrade.
in this case replication between prod and dr will not be able to know that prod index is restored from backup and in this case, dr follower index will have follower_checkpoint higher than leader_checkpoint.

Is there a solution for that ? noting that I could see a _ccr API in elastic search which allows you to update the follower index checkpoint. Or do I need to stop replication between prod and dr, delete follower index, re-trigger index replication again between leader (in prod cluster) and follower index (in dr cluster) ? what would the impact be in case of large data in prod/leader index ?

If there is another best practice way to fix such case which might occur in production, kindly advise. Much Respect.

Related component

Indexing:Replication

To Reproduce

already mentioned in description.

Expected behavior

Not sure if the expected behavior could be physical replication, and if it is, not sure how much time will take to be developed.

Additional Details

opensearch version 2.11.1 opensearch prod and dr sites are deployed on kubernetes

mgodwan commented 6 months ago

@opensearch-project/engineering-effectiveness Can you help move this issue to github.com/opensearch-project/cross-cluster-replication?

ankitkala commented 6 months ago

Hi, restoring snapshot breaks the data consistency between leader & follower domain. Only way to mitigate would be to stop replication, delete follower index & restart replication again.

opensearch-project / cross-cluster-replication

[BUG] Cross Site Replication Fails to Replicate Restored Data on Production #1336

Describing The Case.

Related component

To Reproduce

Expected behavior

Additional Details