opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.73k stars 1.8k forks source link

[BUG] Close index leads to temporary red cluster until shard has started #16016

Open ashking94 opened 1 month ago

ashking94 commented 1 month ago

Describe the bug

As of today, when an index is closed, it makes the cluster red temporarily until the shard has started. I am able to see this issue in both conventional document replication cluster as well as remote store enabled clusters.

Logs on a remote store enabled cluster

opensearch-master1  | [2024-09-20T08:55:03,100][INFO ][o.o.p.PluginsService     ] [opensearch-master1] PluginService:onIndexModule index:[index1/h71N_-WHQcWNEjqbFctFJQ]
opensearch-master1  | [2024-09-20T08:55:03,138][INFO ][o.o.c.m.MetadataCreateIndexService] [opensearch-master1] [index1] creating index, cause [api], templates [], shards [1]/[0]
opensearch-master1  | [2024-09-20T08:55:03,141][WARN ][o.o.c.r.a.AllocationService] [opensearch-master1] Falling back to single shard assignment since batch mode disable or multiple custom allocators set
opensearch-master1  | [2024-09-20T08:55:03,148][INFO ][o.o.g.G.RemotePersistedState] [opensearch-master1] codec version is 4
opensearch-node1    | [2024-09-20T08:55:03,219][INFO ][o.o.p.PluginsService     ] [opensearch-node1] PluginService:onIndexModule index:[index1/h71N_-WHQcWNEjqbFctFJQ]
opensearch-node1    | [2024-09-20T08:55:03,357][INFO ][o.o.i.t.RemoteFsTranslog ] [opensearch-node1] [index1][0] Downloaded data from remote translog till maxSeqNo = -1
opensearch-node1    | [2024-09-20T08:55:03,381][INFO ][o.o.i.s.RemoteStoreRefreshListener] [opensearch-node1] [index1][0] Skipped syncing segments with primaryMode=false indexShardState=RECOVERING engineType=InternalEngine recoverySourceType=EMPTY_STORE primary=true
opensearch-node1    | [2024-09-20T08:55:03,381][INFO ][o.o.i.s.RemoteStoreRefreshListener] [opensearch-node1] [index1][0] Skipped syncing segments with primaryMode=false indexShardState=RECOVERING engineType=InternalEngine recoverySourceType=EMPTY_STORE primary=true
opensearch-node1    | [2024-09-20T08:55:03,382][INFO ][o.o.i.s.RemoteStoreRefreshListener] [opensearch-node1] [index1][0] Skipped syncing segments with primaryMode=false indexShardState=RECOVERING engineType=InternalEngine recoverySourceType=EMPTY_STORE primary=true
opensearch-node1    | [2024-09-20T08:55:03,382][INFO ][o.o.i.s.RemoteStoreRefreshListener] [opensearch-node1] [index1][0] Skipped syncing segments with primaryMode=false indexShardState=RECOVERING engineType=InternalEngine recoverySourceType=EMPTY_STORE primary=true
opensearch-master1  | [2024-09-20T08:55:03,385][INFO ][o.o.c.r.a.AllocationService] [opensearch-master1] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[index1][0]]]).
opensearch-master1  | [2024-09-20T08:55:03,388][INFO ][o.o.g.G.RemotePersistedState] [opensearch-master1] codec version is 4
opensearch-node1    | [2024-09-20T08:55:03,457][INFO ][o.o.i.s.RemoteStoreRefreshListener] [opensearch-node1] [index1][0] Skipped syncing segments with primaryMode=false indexShardState=STARTED engineType=InternalEngine recoverySourceType=EMPTY_STORE primary=true
opensearch-node1    | [2024-09-20T08:55:03,457][INFO ][o.o.i.s.RemoteStoreRefreshListener] [opensearch-node1] [index1][0] Skipped syncing segments with primaryMode=false indexShardState=STARTED engineType=InternalEngine recoverySourceType=EMPTY_STORE primary=true
opensearch-node1    | [2024-09-20T08:55:03,458][INFO ][o.o.i.s.RemoteStoreRefreshListener] [opensearch-node1] [index1][0] Scheduled retry with didRefresh=true
opensearch-master1  | [2024-09-20T08:55:03,482][WARN ][o.o.c.r.a.AllocationService] [opensearch-master1] Falling back to single shard assignment since batch mode disable or multiple custom allocators set
opensearch-master1  | [2024-09-20T08:55:08,198][INFO ][o.o.p.PluginsService     ] [opensearch-master1] PluginService:onIndexModule index:[index1/h71N_-WHQcWNEjqbFctFJQ]
opensearch-master1  | [2024-09-20T08:55:08,229][INFO ][o.o.c.m.MetadataMappingService] [opensearch-master1] [index1/h71N_-WHQcWNEjqbFctFJQ] create_mapping
opensearch-master1  | [2024-09-20T08:55:08,230][INFO ][o.o.g.G.RemotePersistedState] [opensearch-master1] codec version is 4
opensearch-master1  | [2024-09-20T08:55:22,939][INFO ][o.o.c.m.MetadataIndexStateService] [opensearch-master1] closing indices [index1/h71N_-WHQcWNEjqbFctFJQ]
opensearch-master1  | [2024-09-20T08:55:22,940][INFO ][o.o.g.G.RemotePersistedState] [opensearch-master1] codec version is 4
opensearch-master1  | [2024-09-20T08:55:23,006][INFO ][o.o.c.m.MetadataIndexStateService] [opensearch-master1] completed closing of indices [index1]
opensearch-master1  | [2024-09-20T08:55:23,007][WARN ][o.o.c.r.a.AllocationService] [opensearch-master1] Falling back to single shard assignment since batch mode disable or multiple custom allocators set
opensearch-master1  | [2024-09-20T08:55:23,010][INFO ][o.o.g.G.RemotePersistedState] [opensearch-master1] codec version is 4
opensearch-master1  | [2024-09-20T08:55:23,073][WARN ][o.o.c.r.a.AllocationService] [opensearch-master1] Falling back to single shard assignment since batch mode disable or multiple custom allocators set
opensearch-master1  | [2024-09-20T08:55:23,075][INFO ][o.o.g.G.RemotePersistedState] [opensearch-master1] codec version is 4
opensearch-node1    | [2024-09-20T08:55:23,140][INFO ][o.o.p.PluginsService     ] [opensearch-node1] PluginService:onIndexModule index:[index1/h71N_-WHQcWNEjqbFctFJQ]
opensearch-node1    | [2024-09-20T08:55:23,183][INFO ][o.o.i.s.IndexShard       ] [opensearch-node1] [index1][0] Downloaded translog and checkpoint files from=8 to=10
opensearch-node1    | [2024-09-20T08:55:23,207][INFO ][o.o.i.t.RemoteFsTranslog ] [opensearch-node1] [index1][0] Downloaded translog and checkpoint files from=8 to=10
opensearch-node1    | [2024-09-20T08:55:23,209][INFO ][o.o.i.t.RemoteFsTranslog ] [opensearch-node1] [index1][0] Downloaded data from remote translog till maxSeqNo = -1
opensearch-master1  | [2024-09-20T08:55:23,231][INFO ][o.o.c.r.a.AllocationService] [opensearch-master1] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[index1][0]]]).

Logs on doc rep clusters

opensearch-master1  | [2024-09-20T09:00:23,777][INFO ][o.o.p.PluginsService     ] [opensearch-master1] PluginService:onIndexModule index:[index1/G9Qow6fDROaCVD65DX-n0w]
opensearch-master1  | [2024-09-20T09:00:23,821][INFO ][o.o.c.m.MetadataCreateIndexService] [opensearch-master1] [index1] creating index, cause [api], templates [], shards [1]/[0]
opensearch-master1  | [2024-09-20T09:00:23,825][WARN ][o.o.c.r.a.AllocationService] [opensearch-master1] Falling back to single shard assignment since batch mode disable or multiple custom allocators set
opensearch-node1    | [2024-09-20T09:00:23,882][INFO ][o.o.p.PluginsService     ] [opensearch-node1] PluginService:onIndexModule index:[index1/G9Qow6fDROaCVD65DX-n0w]
opensearch-master1  | [2024-09-20T09:00:24,033][INFO ][o.o.c.r.a.AllocationService] [opensearch-master1] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[index1][0]]]).
opensearch-master1  | [2024-09-20T09:00:24,096][WARN ][o.o.c.r.a.AllocationService] [opensearch-master1] Falling back to single shard assignment since batch mode disable or multiple custom allocators set
opensearch-master1  | [2024-09-20T09:00:26,347][INFO ][o.o.p.PluginsService     ] [opensearch-master1] PluginService:onIndexModule index:[index1/G9Qow6fDROaCVD65DX-n0w]
opensearch-master1  | [2024-09-20T09:00:26,378][INFO ][o.o.c.m.MetadataMappingService] [opensearch-master1] [index1/G9Qow6fDROaCVD65DX-n0w] create_mapping
opensearch-master1  | [2024-09-20T09:00:42,889][INFO ][o.o.c.m.MetadataIndexStateService] [opensearch-master1] closing indices [index1/G9Qow6fDROaCVD65DX-n0w]
opensearch-master1  | [2024-09-20T09:00:42,949][INFO ][o.o.c.m.MetadataIndexStateService] [opensearch-master1] completed closing of indices [index1]
opensearch-master1  | [2024-09-20T09:00:42,949][WARN ][o.o.c.r.a.AllocationService] [opensearch-master1] Falling back to single shard assignment since batch mode disable or multiple custom allocators set
opensearch-master1  | [2024-09-20T09:00:43,008][WARN ][o.o.c.r.a.AllocationService] [opensearch-master1] Falling back to single shard assignment since batch mode disable or multiple custom allocators set
opensearch-node1    | [2024-09-20T09:00:43,061][INFO ][o.o.p.PluginsService     ] [opensearch-node1] PluginService:onIndexModule index:[index1/G9Qow6fDROaCVD65DX-n0w]
opensearch-master1  | [2024-09-20T09:00:43,097][INFO ][o.o.c.r.a.AllocationService] [opensearch-master1] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[index1][0]]]).

This problem may be aggravated in remote store enabled cluster due to existing behaviour where the translog is downloaded from remote store. This, however, is being fixed now.

Related component

Cluster Manager

To Reproduce

  1. Create an index
  2. Ingest some docs
  3. Close the index

Expected behavior

I am not very sure if the cluster should really turn red here or not. This gives a false sense of underlying issue that may be causing red cluster. IMHO the cluster should remain green during the close index is happening.

Additional Details

NA

shwetathareja commented 1 month ago

opensearch-master1 | [2024-09-20T09:00:43,097][INFO ][o.o.c.r.a.AllocationService] [opensearch-master1] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[index1][0]]]).

Curious why the shard was started after closed index. May be some race in shard allocation.

rajiv-kv commented 3 weeks ago

[Triage - attendees 1 2 3] Thanks @ashking94 for filing the issue.

Do we have the logs corresponding to the health status change from GREEN/YELLOW to RED to help us understand the duration of RED status ? Is it possible that the close & re-open operation was performed by two consecutive reroute calls.

ashking94 commented 3 weeks ago

Hi @rajiv-kv, I have already shared the logs in the issue description. Let me know what else do you need.

dblock commented 2 weeks ago

[Catch All Triage - 1, 2, 3, 4]