[Remote Store] _cat/recovery APIs provides inconsistent results

Bukhtawar commented 9 months ago

Describe the bug

When compared with total initialising shards in _cluster/health API, the _cat/recovery?active_only shows an inconsistent count of recoveries in progress.


curl localhost:9200/_cluster/health?pretty   
{
"cluster_name" : ":test-poc",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 200,
"number_of_data_nodes" : 194,
"discovered_master" : true,
"discovered_cluster_manager" : true,
"active_primary_shards" : 356,
"active_shards" : 1045,
"relocating_shards" : 6,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

% curl localhost:9200/_cat/recovery?active_only opensearch-client-ingest-2024-03-04t11 41 35.5m peer translog xx.xx.26.41 9402b974b33058c1ae02a4b5661dda2e 172.16.33.240 c4557e0db97459e36fc9ca27c0dad06d n/a n/a 136 136 100.0% 156 3053790354 3053790354 100.0% 3697028691 41390688 0 0.0%


2. The translog download step doesn't populate the translog recovery stats

curl localhost:9200/_cat/recovery?active_only opensearch-client-ingest-2024-03-04t11 4 6.1m peer translog xx.xx.xx.xx 05c8acd9758c7d833fc7abd77ed74727 xx.xx.xx.xx de189fe355a94cec9e526de75d404767 n/a n/a 192 192 100.0% 209 3361023615 3361023615 100.0% 3697263527 41654806 0 0.0% opensearch-client-ingest-2024-03-04t11 41 6.1m peer translog xx.xx.xx.xx 9402b974b33058c1ae02a4b5661dda2e xx.xx.xx.xx c4557e0db97459e36fc9ca27c0dad06d n/a n/a 136 136 100.0% 156 3053790354 3053790354 100.0% 3697028691 41390688 0 0.0% opensearch-client-ingest-2024-03-04t11 82 6.1m peer translog xx.xx.xx.xx fb81f7c57a903d39d463446784b6b4f7 xx.xx.xx.xx 36c64f30e810ee73e01f8b27f914112a n/a n/a 218 218 100.0% 218 3718572999 3718572999 100.0% 3718572999 41340427 0 0.0% opensearch-client-ingest-2024-03-04t11 88 6.1m peer translog xx.xx.xx.xx 6ccbb5a732ebb55a5bb0cb7c68ba7fa7 xx.xx.xx.xx c4557e0db97459e36fc9ca27c0dad06d n/a n/a 169 169 100.0% 186 3426163926 3426163926 100.0% 3707056645 41419312 0 0.0% opensearch-client-ingest-2024-03-04t11 97 6.1m peer translog xx.xx.xx.xx 2ccb6b6c969be5ad2ba792fcba818c88 xx.xx.xx.xx 704ce836066d1b071d681a17814a37d7 n/a n/a 214 214 100.0% 231 3549799591 3549799591 100.0% 3966964286 39712007 0 0.0% opensearch-client-ingest-2024-03-04t11 98 6.1m peer translog xx.xx.xx.xx 64cf55465863ce20799f8885ea335347 xx.xx.xx.xx 1eef064b01e3004c2363feb81e648a81 n/a n/a 158 158 100.0% 161 4744590420 4744590420 100.0% 4997318586 32743206 0 0.0%



### Related component

Storage:Remote

### Expected behavior

Consistent API results

peternied commented 9 months ago

[Triage - attendees 1 2 3 4 5 6 7 8] @Bukhtawar Thanks for filing this issue, this is a very confusing experience and it would be good to address.

lukas-vlcek commented 9 months ago

Hi, I would like to take this ticket. Is there a way how to reproduce the bug?

While conducting my investigation, I welcome your insights and recommendations on specific areas to focus on.

Bukhtawar commented 9 months ago

Yes I believe this should be reproducible on a multi-node setup, hosting shards of few 100MBs, where we exclude IP of one node and trigger a relocation process of shards on the excluded node. Then compare the output of _cluster/health "intialising/relocating" count and _cat/recovery?active_only to see the discrepancy in count.

lukas-vlcek commented 8 months ago

@Bukhtawar Thanks! Do you think you can elaborate bit on "exclude IP of one node". Do you mean exclude the node from a shard allocation?

Would the following scenario be a good candidate?

Imagine a two node cluster, Node 1 having three primary shards, Node 2 being empty.

flowchart TB
    Primary_A
    Primary_B
    Primary_C
    subgraph "Node 1"
    Primary_A
    Primary_B
    Primary_C
    end
    subgraph "Node 2"
    end

Next, we exclude the Node 1 from a shard allocation:

PUT _cluster/settings
{
  "persistent" : {
    "cluster.routing.allocation.exclude._ip" : "_Node 1 IP_"
  }
}

This should (if I am not mistaken) trigger replication of all shards from Node 1 to Node 2.

flowchart TB
    Primary_A--"Replicating"-->Replica_A
    Primary_B--"Replicating"-->Replica_B
    Primary_C--"Replicating"-->Replica_C
    subgraph "Node 1"
    Primary_A
    Primary_B
    Primary_C
    end
    subgraph "Node 2"
    Replica_A
    Replica_B
    Replica_C
    end

Now, while shards are being replicated, we can request _cluster/health and _cat/recovery?active_only (as discussed previously) and that should give us inconsistent counts, correct?

I assume we need shards to be of a "larger size" only because we need to make sure the replication activity takes some time (enough time for us to be able to request counts and compare). How about if we instead throttle the amount of data for replication? This means that shards could be quite small but it will still take some time to replicate. Do you think this will also lead to issue reproduction?

The point is that if using throttling is possible then we should be able to implement a regular unit test.

lukas-vlcek commented 8 months ago

Hi @Bukhtawar

I was looking at this and I found that the following integration test is already testing something very similar:

./server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexRecoveryIT.java

For example it has a test called testRerouteRecovery() that uses the following scenario:

It starts a cluster with a single node (A)
It creates a new index with single shard and no replicas
Then it adds new node (B) to the cluster
Then is slows down recoveries (using RecoverySettings.INDICES_RECOVERY_MAX_BYTES_PER_SEC_SETTING)
Then it forces relocation of the shard using "admin.cluster.reroute" request (so a bit different strategy than discussed above, but still this triggers the recovery process)
It check for count of active (ie. stage != DONE) shard recoveries etc...
...

I was experimenting and modified some tests and added "admin.cluster.health" request into them to get initializing and relocating shard counts and so far I was not able to spot/replicate the count discrepancy.

Do you think it can be because the size of the index in the test is quite small (just couple of 100kbs)? Though, the test explicitly makes sure the counts are obtained while the recovery process is throttled and the shard recovery stage is not DONE (in other words the counts are compared while the recovery is still running).

However, there is still another question I wanted to ask. Did you have anything specific in mind when you said:

The translog download step doesn't populate the translog recovery stats

Can you elaborate on this please?

I will push modification of the test tomorrow so that you can see what I mean.

lukas-vlcek commented 8 months ago

@Bukhtawar

Please see https://github.com/opensearch-project/OpenSearch/pull/12792 I believe this is very detailed try to reproduce the issue. Unfortunately, it is not reproducing the issue currently (the test passes, which means the issue does not materialize).

Can you think of some hits about what to change in order to recreate the issue?

For example, do you think the shard recovery state stage has to be Stage.TRANSLOG? Notice that in the IT the stage is currenltyStage.INDEX.

Bukhtawar commented 8 months ago

Adding @sachinpkale for his thoughts as well. Will take a look shortly

shourya035 commented 2 months ago

@sachinpkale @Bukhtawar This PR is waiting on your inputs. Can you bring this to closure?

opensearch-project / OpenSearch

[Remote Store] _cat/recovery APIs provides inconsistent results #12047

Describe the bug