opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
8.83k stars 1.62k forks source link

[BUG] Unable to enable remote state repository independently of data repositories #13523

Open BhumikaSaini-Amazon opened 2 weeks ago

BhumikaSaini-Amazon commented 2 weeks ago

Describe the bug

Enabling just remote state repo (support added via PR #11858 ) starts a remote store migration. This migration doesn’t go through. The shards stay unassigned.

Related component

Storage

To Reproduce

  1. Launch a docrep cluster with remote state enabled. Do not enable remote segment and remote translog repositories.
  2. Create an index.
  3. Shard creation fails and shards stay unassigned.

Expected behavior

Enabling only remote state repository should not start a migration to remote store

Additional Details

Exception stack trace

[2024-05-02T13:17:25,436][INFO ][o.o.i.IndexService       ] [node-1] [idx1] DocRep shard [idx1][3] is migrating to remote
[2024-05-02T13:17:25,436][WARN ][o.o.i.c.IndicesClusterStateService] [node-1] [idx1][3] marking and sending shard failed due to [failed to create shard]
java.lang.NullPointerException: Cannot invoke "Object.hashCode()" because "key" is null
    at java.base/java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936) ~[?:?]
    at org.opensearch.repositories.RepositoriesService.repository(RepositoriesService.java:568) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.index.store.RemoteSegmentStoreDirectoryFactory.newDirectory(RemoteSegmentStoreDirectoryFactory.java:61) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.index.IndexService.createShard(IndexService.java:512) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.indices.IndicesService.createShard(IndicesService.java:1025) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.indices.IndicesService.createShard(IndicesService.java:213) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:672) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:649) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:294) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:608) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:595) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:563) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:486) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:188) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
    at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]

Proposed solution The migration flow uses the isRemoteStoreNode( ) method. This method checks for the presence of any remote_store node attribute on the node: https://github.com/opensearch-project/OpenSearch/blob/ed33488aa426bd618685729fc638adad763f6ff7/server/src/main/java/org/opensearch/cluster/node/DiscoveryNode.java#L464-L471

Given that the state repo should be independent now, we should have distinct methods to identify whether the cluster state or data is remote-backed.

sulthan309 commented 2 weeks ago

Hi, I would like to contribute to resolve this bug. Can you please assign this to me?

BhumikaSaini-Amazon commented 1 week ago

Thank you @sulthan309 for volunteering!

We are tracking this bugfix for the 2.15 release. We want to get the fix merged to main and backported to 2.x by the code freeze date of 10th June (calendar).

Please do check how the method is used at various places. That will help with identifying the changes we need. If you need more info anytime, please let us know.

Looking forward to your contribution!

sulthan309 commented 1 week ago

Thank you for assigning this ticket to me.

@BhumikaSaini-Amazon Sure i will go through the code and reach out if needed.