Open shwetathareja opened 6 months ago
This listener is executed on every node so this will result in n nodes info from every node which itself is a broadcast action. why is this skip logic needs to be executed on every node?
This listener is to figure out whether there are different versions of ISM plugin in the cluster and stop the node from execution if there are. Each node does this separately.
As a short term solution, we can try switch thread for doing the node info call, in kotlin way thread would be coroutine instead. Long term solution is to provide this mechanism of knowing cluster upgrade status from core probably.
What is the bug? For any cluster state change, plugins can attach their listeners and execute the the desired code functionality. These listeners are executed in the ClusterApplierService#updateTask threadpool which is single threaded and blocks processing and applying of any new state updates. ISM has also registered a listener which does expensive node info call which is a broadcast call in one of its listener. On top of it, this listener is getting executed on every node which looks like un-necessary overhead. In one of the large cluster noticed, applier thread on multiple nodes was busy doing this for minutes.
https://github.com/opensearch-project/index-management/blob/6270b2cbd0a344f0e26978bbb1b35aac9cba6d20/src/main/kotlin/org/opensearch/indexmanagement/indexstatemanagement/PluginVersionSweepCoordinator.kt#L59
How can one reproduce the bug? Anytime cluster is bootstrapped for the first time or nodes join/ leave the cluster, this listener will be executed.
What is the expected behavior?
Do you have any screenshots?