opensearch-project / cross-cluster-replication

Synchronize your data across multiple clusters for lower latencies and higher availability
https://opensearch.org/docs/latest/replication-plugin/index/
Apache License 2.0
48 stars 58 forks source link

[BUG]Auto follow error #1309

Closed ivan32rus closed 7 months ago

ivan32rus commented 8 months ago

What is the bug? Cross-cluster replication (Automatic starts replication)

Liader (connection-cluster-TEST01) Installing - 6 VM 3 master and 3 data Follower (connection-cluster-TEST02) Installing - 6 VM 3 master and 3 data

After replication fails, it is not possible to continue replication - ResourceAlreadyExistsException[task with id {replication:index:TEST-INDEX} already exist]

How can one reproduce the bug? Steps to reproduce the behavior:

  1. On the Liader POST /_plugins/_replication/_autofollow?pretty { "leader_alias" : "connection-cluster-TEST01", "name": "index-TEST01", "pattern": "index-001-v1-*", "use_roles":{ "leader_cluster_role": "all_access", "follower_cluster_role": "all_access" } }
  2. Replication failure (external factor - loss of connectivity between clusters)
  3. Replication stops and does not recover after a while...
  4. I'm re-creating replication by first deleting the old data 4.1 DELETE /_plugins/_replication/_autofollow?pretty { "leader_alias" : "connection-cluster-TEST01", "name": "index-TEST01" }

4.2 POST /_plugins/_replication/index-001-v1-2024.01/_stop {}

4.3 DELETE index-001-v1-2024.01

  1. Again on the Liader POST /_plugins/_replication/_autofollow?pretty { "leader_alias" : "connection-cluster-TEST01", "name": "index-TEST01", "pattern": "index-001-v1-*", "use_roles":{ "leader_cluster_role": "all_access", "follower_cluster_role": "all_access" } }
  2. I get an error: [2024-01-18T14:14:21,815][ERROR][o.o.r.a.i.TransportReplicateIndexClusterManagerNodeAction] [node-dn03] Failed to trigger replication for index-001-v1-2024.01 - ResourceAlreadyExistsException[task with id {replication:index:index-001-v1-2024.01} already exist]

But there is no such index! And I can't find a task to delete it! This problem repeats all the time with different indexes, when connectivity between clusters is disrupted!

What is your host/environment? OS: linux Distribution opensearch number: 2.3.0 Plugins version: 2.3.0.0

ps Maybe these are related things, but these solutions don't help me! https://github.com/opensearch-project/cross-cluster-replication/issues/840 https://github.com/opensearch-project/cross-cluster-replication/issues/202

nisgoel-amazon commented 7 months ago

Hi @ivan32rus Can you verify the cluster state after stopping the replication to check whether task is present in persistent_tasks. Also check what is the value of assignment key in persistent task object related to replication:index:index-001-v1-2024.01 task. By running this command you will get the persistent tasks in the cluster.

curl localhost:9200/_cluster/state?filter_path=**.persistent_tasks

ivan32rus commented 7 months ago

Hi @nisgoel-amazon !!! I see such an entry "assignment" : { "executor_node" : null Restarting replication does not help...

The question arises, is there anyway to remove the hung persistent_tasks associated with ccr.

I understand there is a solution here?! https://github.com/opensearch-project/cross-cluster-replication/pull/905

Related question https://github.com/opensearch-project/cross-cluster-replication/issues/840

We are really looking forward to the solution in the task "Remove all style replication tasks from cluster state #905"

Thanks!

nisgoel-amazon commented 7 months ago

Hi @ivan32rus @monusingh-1 Already answered your question in https://github.com/opensearch-project/cross-cluster-replication/issues/840. Can we close this ticket

ivan32rus commented 7 months ago

Thank you, yes of course!!!