opensearch-project / cross-cluster-replication

Synchronize your data across multiple clusters for lower latencies and higher availability
https://opensearch.org/docs/latest/replication-plugin/index/
Apache License 2.0
47 stars 57 forks source link

[Issue] Cross cluster replication is too slow between Opensearch clusters deployed on two separate kubernetes clusters. #1385

Open VinDataR opened 3 months ago

VinDataR commented 3 months ago

Describe the bug

Environment information:

VERSON: opensearch:2.9.0-release-4.14.0-29.12.2023

We have two opensearch clusters deployed on two different Kubernetes clusters in Azure.

We have enabled cross cluster replication between these two opensearch clusters deployed on two seaprate Kuberenetes clusters.

Whenever we push some documents into indices to the leader cluster and it is taking much longer time to replica. We have tried with indices which are in KiloBytes as well as larger indices which are in GigaBytes.

In both cases, there is a definite lag we could observe from the time of pushing the documents in the leader site and the same documents being available in the follower site.

We had pushed an index 'projecttask' on the leader site. These are stats from leader site.

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open projecttask atOBXJVWT_6wp7cwV_Y0Mw 3 2 4657988 1786411 51.3gb 17.1gb

When we kept monitoring the remote site, the replication status on follower site was showing as BOOTSTRAPPING for a long time. Now status is showing as SYNCING but the time duration is taking a long time.

Posted output of replication status API at several times yesterday to monitor the status. If you can see the bytes_percent and files_percent, it is progressing very gradually. Seeking help on how to troubleshoot this further and to speed up the replication.

curl -XGET -k -u ':' 'http://localhost:9200/_plugins/_replication/projecttask/_status?pretty'

{
  "status" : "BOOTSTRAPPING",
  "reason" : "User initiated",
  "leader_alias" : "my-connection-test",
  "leader_index" : "projecttask",
  "follower_index" : "projecttask",
  "syncing_details" : {
    "leader_checkpoint" : 2161975,
    "follower_checkpoint" : 2161975,
    "seq_no" : 2161975
  },
  "bootstrap_details" : {
    "bytes_total" : 12246723560,
    "bytes_recovered" : 62150110,
    "bytes_percent" : 0.50800186,
    "files_total" : 419,
    "files_recovered" : 12,
    "files_percent" : 2.8643699,
    "start_time" : 1714390992205,
    "running_time" : 209247
  }
}

{
  "status" : "BOOTSTRAPPING",
  "reason" : "User initiated",
  "leader_alias" : "my-connection-test",
  "leader_index" : "projecttask",
  "follower_index" : "projecttask",
  "syncing_details" : {
    "leader_checkpoint" : 2161975,
    "follower_checkpoint" : 2161975,
    "seq_no" : 2161975
  },
  "bootstrap_details" : {
    "bytes_total" : 12246723560,
    "bytes_recovered" : 62150110,
    "bytes_percent" : 0.50800186,
    "files_total" : 419,
    "files_recovered" : 12,
    "files_percent" : 2.8643699,
    "start_time" : 1714390992205,
    "running_time" : 229621
  }
}

{
  "status" : "BOOTSTRAPPING",
  "reason" : "User initiated",
  "leader_alias" : "my-connection-test",
  "leader_index" : "projecttask",
  "follower_index" : "projecttask",
  "syncing_details" : {
    "leader_checkpoint" : 2161975,
    "follower_checkpoint" : 2161975,
    "seq_no" : 2161975
  },
  "bootstrap_details" : {
    "bytes_total" : 12246723560,
    "bytes_recovered" : 166715937,
    "bytes_percent" : 1.361734,
    "files_total" : 419,
    "files_recovered" : 29,
    "files_percent" : 6.9193783,
    "start_time" : 1714390992205,
    "running_time" : 1846339
  }
}

{
  "status" : "BOOTSTRAPPING",
  "reason" : "User initiated",
  "leader_alias" : "my-connection-test",
  "leader_index" : "projecttask",
  "follower_index" : "projecttask",
  "syncing_details" : {
    "leader_checkpoint" : 2161975,
    "follower_checkpoint" : 2161975,
    "seq_no" : 2161975
  },
  "bootstrap_details" : {
    "bytes_total" : 12246723560,
    "bytes_recovered" : 166715937,
    "bytes_percent" : 1.361734,
    "files_total" : 419,
    "files_recovered" : 29,
    "files_percent" : 6.9193783,
    "start_time" : 1714390992205,
    "running_time" : 2521097
  }
}

{
  "status" : "BOOTSTRAPPING",
  "reason" : "User initiated",
  "leader_alias" : "my-connection-test",
  "leader_index" : "projecttask",
  "follower_index" : "projecttask",
  "syncing_details" : {
    "leader_checkpoint" : 2161975,
    "follower_checkpoint" : 2161975,
    "seq_no" : 2161975
  },
  "bootstrap_details" : {
    "bytes_total" : 12246723560,
    "bytes_recovered" : 282059954,
    "bytes_percent" : 2.3038514,
    "files_total" : 419,
    "files_recovered" : 31,
    "files_percent" : 7.391076,
    "start_time" : 1714390992205,
    "running_time" : 2800792
  }
}

{
  "status" : "SYNCING",
  "reason" : "User initiated",
  "leader_alias" : "my-connection-test",
  "leader_index" : "projecttask",
  "follower_index" : "projecttask",
  "syncing_details" : {
    "leader_checkpoint" : 6474917,
    "follower_checkpoint" : 6474917,
    "seq_no" : 6474917
  }
} 
 curl -XGET -vkguadmin:admin "http://localhost:9200/_cluster/settings?pretty"
Note: Unnecessary use of -X or --request, GET is already inferred.
*   Trying 127.0.0.1:9200...
* Connected to localhost (127.0.0.1) port 9200
* Server auth using Basic with user 'admin'
> GET /_cluster/settings?pretty HTTP/1.1
> Host: localhost:9200
> Authorization: Basic YWRtaW46YWRtaW4=
> User-Agent: curl/8.3.0
> Accept: */*
>
< HTTP/1.1 200 OK
< content-type: application/json; charset=UTF-8
< content-length: 471
<
{
  "persistent" : {
    "cluster" : {
      "remote" : {
        "my-connection-test" : {
          "mode" : "proxy",
          "proxy_address" : "*.*.*.100:9300"
        }
      }
    },
    "plugins" : {
      "index_state_management" : {
        "template_migration" : {
          "control" : "-1"
        }
      }
    },
    "logger" : {
      "org" : {
        "opensearch" : {
          "transport" : "DEBUG"
        }
      }
    }
  },
  "transient" : { }
}
curl -XGET -vkguadmin:admin "http://localhost:9200/_plugins/_replication/autofollow_stats

{
  "num_success_start_replication" : 2,
  "num_failed_start_replication" : 0,
  "num_failed_leader_calls" : 0,
  "failed_indices" : [ ],
  "autofollow_stats" : [
    {
      "name" : "replicate-all-rule",
      "pattern" : "*",
      "num_success_start_replication" : 2,
      "num_failed_start_replication" : 0,
      "num_failed_leader_calls" : 0,
      "failed_indices" : [ ],
      "last_execution_time" : 1714468771677
    }
  ]
curl -XGET -vkguadmin:admin "http://localhost:9200/_plugins/_replication/follower_stats"
Note: Unnecessary use of -X or --request, GET is already inferred.
*   Trying 127.0.0.1:9200...
* Connected to localhost (127.0.0.1) port 9200
* Server auth using Basic with user 'admin'
> GET /_plugins/_replication/follower_stats HTTP/1.1
> Host: localhost:9200
> Authorization: Basic YWRtaW46YWRtaW4=
> User-Agent: curl/8.3.0
> Accept: */*
>
< HTTP/1.1 200 OK
< content-type: application/json; charset=UTF-8
< content-length: 1844
<
{
"num_syncing_indices" : 4,
"num_bootstrapping_indices" : 0,
"num_paused_indices" : 2,
"num_failed_indices" : 1,
"num_shard_tasks" : 6,
"num_index_tasks" : 4,
"operations_written" : 0,
"operations_read" : 0,
"failed_read_requests" : 60,
"throttled_read_requests" : 0,
"failed_write_requests" : 0,
"throttled_write_requests" : 0,
"follower_checkpoint" : 6475327,
"leader_checkpoint" : 6475327,
"total_write_time_millis" : 0,
"index_stats" : {
  "security-auditlog-2024.04.08" : {
    "operations_written" : 0,
    "operations_read" : 0,
    "failed_read_requests" : 11,
    "throttled_read_requests" : 0,
    "failed_write_requests" : 0,
    "throttled_write_requests" : 0,
    "follower_checkpoint" : 18,
    "leader_checkpoint" : 18,
    "total_write_time_millis" : 0
  },
  "projecttask" : {
    "operations_written" : 0,
    "operations_read" : 0,
    "failed_read_requests" : 8,
    "throttled_read_requests" : 0,
    "failed_write_requests" : 0,
    "throttled_write_requests" : 0,
    "follower_checkpoint" : 6474917,
    "leader_checkpoint" : 6474917,
    "total_write_time_millis" : 0
  },
  "test-01" : {
    "operations_written" : 0,
    "operations_read" : 0,
    "failed_read_requests" : 9,
    "throttled_read_requests" : 0,
    "failed_write_requests" : 0,
    "throttled_write_requests" : 0,
    "follower_checkpoint" : 392,
    "leader_checkpoint" : 392,
    "total_write_time_millis" : 0
  },
  "auto-01" : {
    "operations_written" : 0,
    "operations_read" : 0,
    "failed_read_requests" : 32,
    "throttled_read_requests" : 0,
    "failed_write_requests" : 0,
    "throttled_write_requests" : 0,
    "follower_checkpoint" : 0,
    "leader_checkpoint" : 0,
    "total_write_time_millis" : 0
  }
}

Related component

Plugins

To Reproduce

  1. Enable autofollow in follower site.
  2. Create new index in leader site.
  3. Check for index replication in follower site.
  4. Check for Replication status which is in bootstrapping mode for the test index and it takes longer for the entire data to get replicated.

Expected behavior

The time taken to replicate the leader index to the follower site should not take hours.

Additional Details

Plugins Please list all plugins currently enabled.

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

Additional context Add any other context about the problem here.

VinDataR commented 3 months ago

We deleted the follower index again and it got recreated but took several hours to replicate 14GB of index data to the secondary site. Now the status of the replication status is showing as below;

{
  "status" : "PAUSED",
  "reason" : "AutoPaused:  + [[projecttask-clone][0] - org.opensearch.index.seqno.RetentionLeaseNotFoundException - \"retention lease with ID [replication:dev2-opensearch2x-cluster:uWToVkHzTomO2pkkqmbMSg:[projecttask-clone][0]] not found\"], ",
  "leader_alias" : "my-connection-test",
  "leader_index" : "projecttask-clone",
  "follower_index" : "projecttask-clone"
}
VinDataR commented 3 months ago

Can someone please acknowledge and help on this issue.

ankitkala commented 3 months ago

There isn't a easy way to figure out why performance is slow here. My suspicion is that either:

  1. the leader cluster is slow & taking too long to respond.
  2. or the network is slow and file copy is taking too long.
VinDataR commented 3 months ago

@ankitkala ,

We were able to deduce the source of issue to the number of socket connections to the remote cluster.

 $ curl -XGET -k -H 'Content-Type: application/json' -u '****:******' 'http://localhost:9200/_remote/info?pretty'

{
  "Dev2-to-uat" : {
    "connected" : true,
    "mode" : "proxy",
    "proxy_address" : "*.*.*.*:9300",
    "server_name" : "",
    "num_proxy_sockets_connected" : 18,
    "max_proxy_socket_connections" : 18,
    "initial_connect_timeout" : "3m",
    "skip_unavailable" : false
  }
}

Here we could see num_proxy_sockets_connected has reached the value of num_proxy_sockets_connected which is 18. We could the replication tasks also stuck at this number;

$ curl -XGET -u ****:***** 'http://localhost:9200/_cat/tasks?pretty'
cluster:admin/plugins/replication/autofollow[c] VvoF6xL_RaiFJ9lxCS0aDg:507516  cluster:1                      persistent 1715341742736 11:49:02 2.9d        *.*.14.202 platform-opensearch2x-data-1
cluster:indices/admin/replication[c]            VvoF6xL_RaiFJ9lxCS0aDg:507533  cluster:2                      persistent 1715341742993 11:49:02 2.9d        *.*.14.202 platform-opensearch2x-data-1
cluster:indices/shards/replication[c]           VvoF6xL_RaiFJ9lxCS0aDg:507579  cluster:3                      persistent 1715341743980 11:49:03 2.9d        *.*.14.202 platform-opensearch2x-data-1
cluster:indices/admin/replication[c]            L-gJPsFhSLGaUig2CaNp6w:478218  cluster:4                      persistent 1715342823679 12:07:03 2.9d        *.*.9.14   platform-opensearch2x-data-4
cluster:indices/shards/replication[c]           aP9XS6j7TBm0o34pqp5zMg:472346  cluster:5                      persistent 1715343888989 12:24:48 2.9d        *.*.11.35  platform-opensearch2x-data-0
cluster:indices/admin/replication[c]            aP9XS6j7TBm0o34pqp5zMg:1196554 cluster:6                      persistent 1715581542551 06:25:42 5h          *.*.11.35  platform-opensearch2x-data-0
cluster:indices/shards/replication[c]           VvoF6xL_RaiFJ9lxCS0aDg:1326777 cluster:7                      persistent 1715581543309 06:25:43 5h          *.*.14.202 platform-opensearch2x-data-1
cluster:indices/admin/replication[c]            GQ_QGxxEQBizEIv8VCbeng:1177761 cluster:9                      persistent 1715586495397 07:48:15 3.6h        *.*.16.104 platform-opensearch2x-data-2
cluster:indices/shards/replication[c]           3_Ir3geFTpCrV1AirO8g_A:1271639 cluster:13                     persistent 1715586686679 07:51:26 3.6h        *.*.9.112  platform-opensearch2x-data-3
cluster:indices/admin/replication[c]            3_Ir3geFTpCrV1AirO8g_A:1290991 cluster:14                     persistent 1715592123245 09:22:03 2h          *.*.9.112  platform-opensearch2x-data-3
cluster:indices/admin/replication[c]            L-gJPsFhSLGaUig2CaNp6w:1312434 cluster:15                     persistent 1715592123638 09:22:03 2h          *.*.9.14   platform-opensearch2x-data-4
cluster:indices/shards/replication[c]           GQ_QGxxEQBizEIv8VCbeng:1201388 cluster:16                     persistent 1715592124436 09:22:04 2h          *.*.16.104 platform-opensearch2x-data-2
cluster:indices/shards/replication[c]           aP9XS6j7TBm0o34pqp5zMg:1227312 cluster:17                     persistent 1715592184310 09:23:04 2h          *.*.11.35  platform-opensearch2x-data-0
cluster:indices/admin/replication[c]            VvoF6xL_RaiFJ9lxCS0aDg:1377296 cluster:18                     persistent 1715598672724 11:11:12 16.6m       *.*.14.202 platform-opensearch2x-data-1
cluster:indices/admin/replication[c]            aP9XS6j7TBm0o34pqp5zMg:1246481 cluster:19                     persistent 1715598673055 11:11:13 16.6m       *.*.11.35  platform-opensearch2x-data-0
cluster:monitor/tasks/lists                     FqZ867VYQfyWO6a6ABrFYA:1112689 -                              transport  1715599674584 11:27:54 984.4micros *.*.10.214 platform-opensearch2x-master-0

We increased the proxy_socket_connections to 100 first and we could all the pending indexes got immediately replicated. (NOTE: cluster settings API didn't accept max_proxy_socket_connections field but accpeted max_proxy_socket_connections)

{
  "persistent" : {
    "cluster" : {
      "remote" : {
        "Dev2-to-uat" : {
          "mode" : "proxy",
          "proxy_address" : "*.*.*.*:9300",
          "max_proxy_socket_connections" : "100"
        }
      }

We coould see new replicationt asks getting created and successful replicaiton of the pending indices.

$ curl -XGET -u admin:admin 'http://localhost:9200/_cat/tasks?pretty'
cluster:admin/plugins/replication/autofollow[c] VvoF6xL_RaiFJ9lxCS0aDg:507516  cluster:1                      persistent 1715341742736 11:49:02 2.9d        *.*.14.202 platform-opensearch2x-data-1
cluster:indices/admin/replication[c]            VvoF6xL_RaiFJ9lxCS0aDg:507533  cluster:2                      persistent 1715341742993 11:49:02 2.9d        *.*.14.202 platform-opensearch2x-data-1
cluster:indices/shards/replication[c]           VvoF6xL_RaiFJ9lxCS0aDg:507579  cluster:3                      persistent 1715341743980 11:49:03 2.9d        *.*.14.202 platform-opensearch2x-data-1
cluster:indices/admin/replication[c]            L-gJPsFhSLGaUig2CaNp6w:478218  cluster:4                      persistent 1715342823679 12:07:03 2.9d        *.*.9.14   platform-opensearch2x-data-4
cluster:indices/shards/replication[c]           aP9XS6j7TBm0o34pqp5zMg:472346  cluster:5                      persistent 1715343888989 12:24:48 2.9d        *.*.11.35  platform-opensearch2x-data-0
cluster:indices/admin/replication[c]            aP9XS6j7TBm0o34pqp5zMg:1196554 cluster:6                      persistent 1715581542551 06:25:42 5.3h        *.*.11.35  platform-opensearch2x-data-0
cluster:indices/shards/replication[c]           VvoF6xL_RaiFJ9lxCS0aDg:1326777 cluster:7                      persistent 1715581543309 06:25:43 5.3h        *.*.14.202 platform-opensearch2x-data-1
cluster:indices/admin/replication[c]            GQ_QGxxEQBizEIv8VCbeng:1177761 cluster:9                      persistent 1715586495397 07:48:15 3.9h        *.*.16.104 platform-opensearch2x-data-2
cluster:indices/shards/replication[c]           3_Ir3geFTpCrV1AirO8g_A:1271639 cluster:13                     persistent 1715586686679 07:51:26 3.8h        *.*.9.112  platform-opensearch2x-data-3
cluster:indices/admin/replication[c]            3_Ir3geFTpCrV1AirO8g_A:1290991 cluster:14                     persistent 1715592123245 09:22:03 2.3h        *.*.9.112  platform-opensearch2x-data-3
cluster:indices/admin/replication[c]            L-gJPsFhSLGaUig2CaNp6w:1312434 cluster:15                     persistent 1715592123638 09:22:03 2.3h        *.*.9.14   platform-opensearch2x-data-4
cluster:indices/shards/replication[c]           GQ_QGxxEQBizEIv8VCbeng:1201388 cluster:16                     persistent 1715592124436 09:22:04 2.3h        *.*.16.104 platform-opensearch2x-data-2
cluster:indices/shards/replication[c]           aP9XS6j7TBm0o34pqp5zMg:1227312 cluster:17                     persistent 1715592184310 09:23:04 2.3h        *.*.11.35  platform-opensearch2x-data-0
cluster:indices/admin/replication[c]            VvoF6xL_RaiFJ9lxCS0aDg:1377296 cluster:18                     persistent 1715598672724 11:11:12 33.3m       *.*.14.202 platform-opensearch2x-data-1
cluster:indices/admin/replication[c]            aP9XS6j7TBm0o34pqp5zMg:1246481 cluster:19                     persistent 1715598673055 11:11:13 33.3m       *.*.11.35  platform-opensearch2x-data-0
cluster:indices/admin/replication[c]            GQ_QGxxEQBizEIv8VCbeng:1227566 cluster:20                     persistent 1715600571948 11:42:51 1.7m        *.*.16.104 platform-opensearch2x-data-2
cluster:indices/admin/replication[c]            3_Ir3geFTpCrV1AirO8g_A:1321408 cluster:21                     persistent 1715600630643 11:43:50 43.3s       *.*.9.112  platform-opensearch2x-data-3
cluster:indices/shards/replication[c]           L-gJPsFhSLGaUig2CaNp6w:1347842 cluster:22                     persistent 1715600631184 11:43:51 42.8s       *.*.9.14   platform-opensearch2x-data-4
cluster:indices/shards/replication[c]           aP9XS6j7TBm0o34pqp5zMg:1252379 cluster:23                     persistent 1715600631643 11:43:51 42.3s       *.*.11.35  platform-opensearch2x-data-0
cluster:indices/shards/replication[c]           VvoF6xL_RaiFJ9lxCS0aDg:1383053 cluster:24                     persistent 1715600633473 11:43:53 40.5s       *.*.14.202 platform-opensearch2x-data-1
cluster:indices/admin/replication[c]            VvoF6xL_RaiFJ9lxCS0aDg:1383188 cluster:25                     persistent 1715600661004 11:44:21 13s         *.*.14.202 platform-opensearch2x-data-1
cluster:indices/shards/replication[c]           VvoF6xL_RaiFJ9lxCS0aDg:1383233 cluster:26                     persistent 1715600661641 11:44:21 12.3s       *.*.14.202 platform-opensearch2x-data-1
cluster:indices/shards/replication[c]           VvoF6xL_RaiFJ9lxCS0aDg:1383299 cluster:27                     persistent 1715600670570 11:44:30 3.4s        *.*.14.202 platform-opensearch2x-data-1
cluster:monitor/tasks/lists                     FqZ867VYQfyWO6a6ABrFYA:1115330 -                              transport  1715600674028 11:44:34 1ms         *.*.10.214 platform-opensearch2x-master-0

However, now when check the remote/info API, the num_proxy_sockets_connections and max_proxy_socket_connections are showing as same number.

$ curl -XGET -k -H 'Content-Type: application/json' -u '****:*****' 'http://localhost:9200/_remote/info?pretty'
{
  "Dev2-to-uat" : {
    "connected" : true,
    "mode" : "proxy",
    "proxy_address" : "*.*.*.*:9300",
    "server_name" : "",
    "max_proxy_socket_connections" : 100,
    "max_proxy_socket_connections" : 100,
    "initial_connect_timeout" : "3m",
    "skip_unavailable" : false
  }
}

Now, when I try to create a new index on the Leader, it is not getting replicated until I increase the proxy_socket_connections to a higher value, thereby allowing the follower to create new replication connections. I tried digging a lot but couldn't find much documentation on Proxy mode for remote connections in opensearch.

Is there any way we can set just the max_proxy_socket_connections without affecting the num_proxy_socket_connections ?

dblock commented 2 months ago

Catch All Triage - 1 2 3 4 5 6