opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.19k stars 1.7k forks source link

Optimise Logging Cluster State Health Changes #14647

Open Bukhtawar opened 1 month ago

Bukhtawar commented 1 month ago

Is your feature request related to a problem? Please describe

For a reasonably big cluster with 500k shards, logging cluster health changes becomes expensive after every reroute operation.

96.7% (9.6s out of 10s) cpu usage by thread 'opensearch[74e0b23bcf51c21918e96f38f93e1491][clusterManagerService#updateTask][T#1]'
     2/10 snapshots sharing following 22 elements
       java.base@17.0.9/java.util.Collections$UnmodifiableCollection$1.hasNext(Collections.java:1053)
       app//org.opensearch.cluster.routing.RoutingTable.allShards(RoutingTable.java:245)
       app//org.opensearch.cluster.routing.RoutingTable.allShards(RoutingTable.java:225)
       app//org.opensearch.cluster.health.ClusterStateHealth.<init>(ClusterStateHealth.java:138)
       app//org.opensearch.cluster.health.ClusterStateHealth.<init>(ClusterStateHealth.java:77)
       app//org.opensearch.cluster.routing.allocation.AllocationService.buildResultAndLogHealthChange(AllocationService.java:186)
       app//org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:528)
       app//org.opensearch.node.Node$$Lambda$2602/0x0000004000a253b8.apply(Unknown Source)
       app//org.opensearch.cluster.routing.BatchedRerouteService$1.execute(BatchedRerouteService.java:136)
       app//org.opensearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:67)
       app//org.opensearch.cluster.service.MasterService.executeTasks(MasterService.java:882)
       app//org.opensearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:434)
       app//org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:301)
       app//org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:212)
       app//org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:209)
       app//org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:247)
       app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:863)
       app//org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283)
       app//org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246)
       java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
       java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
       java.base@17.0.9/java.lang.Thread.run(Thread.java:840)

Describe the solution you'd like

All we care about is the status RED/YELLOW which can be derived using just the unassigned shards

Related component

ShardManagement:Performance

Describe alternatives you've considered

No response

Additional context

No response

peternied commented 3 weeks ago

[Triage - attendees 1 2 3 4 5] @Bukhtawar Thanks for creating this issue