opensearch-project / cross-cluster-replication

Synchronize your data across multiple clusters for lower latencies and higher availability
https://opensearch.org/docs/latest/replication-plugin/index/
Apache License 2.0
47 stars 58 forks source link

[BUG] Stop replication is failing when ultrawarm migration put `index.blocks.ultrawarm_allow_delete` setting on follower index #1337

Open nisgoel-amazon opened 6 months ago

nisgoel-amazon commented 6 months ago

What is the bug? There is an issue which can happen when stop replication is delayed for the follower index due to automated snapshot in progress.

There is an ISM policy defined in leader domain to rollover the index and do ultrawarm migration.

When index in leader domain is rollover and ultrawarm migration starts, follower domain is notified and it tries to stops the replication. But when automated snapshot is in progress, SnapshotInProgressException is coming while stoping the replication.

Meanwhile leader index moves to ultrawarm node, and applies following settings to the leader index. {"index.auto_expand_replicas":"false","index.blocks.ultrawarm_allow_delete":"true","index.refresh_interval":"-1"}

As replication is still running the setting are applied to the follower index, and later when snapshot is successful, Index moves to close state to stop replication but as settings were applied replication fails, and index remains in close state.

What is the expected behavior? Stop replication call should be successful after snapshot is successful. And Index should not left in closed state.

What is your host/environment?