3 generic threads were blocked for processing publication response (it was 3 node cluster in test)
Thread[id=5851, name=opensearch[node_t0][generic][T#2], state=BLOCKED, group=TGRP-SearchWeightedRoutingIT]
2> at org.opensearch.cluster.coordination.Coordinator$5.onResponse(Coordinator.java:1381)
2> at org.opensearch.cluster.coordination.PublicationTransportHandler$PublicationContext$3.handleResponse(PublicationTransportHandler.java:442)
2> at org.opensearch.cluster.coordination.PublicationTransportHandler$PublicationContext$3.handleResponse(PublicationTransportHandler.java:433)
2> at org.opensearch.telemetry.tracing.handler.TraceableTransportResponseHandler.handleResponse(TraceableTransportResponseHandler.java:72)
2> at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1501)
2> at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:420)
2> at org.opensearch.transport.InboundHandler.lambda$handleResponse$3(InboundHandler.java:414)
2> at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854)
2> at java.****/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
2> at java.****/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
All the threads were probably waiting on below mutex :
Investigate which code path was holding mutex and if it can be optimized (lock duration for code path which was holding it). Right now, it is not clear for how long the threads were blocked.
Describe the bug
Observed this during gradle-check run for https://github.com/opensearch-project/OpenSearch/pull/12813#issuecomment-2011604246
https://build.ci.opensearch.org/job/gradle-check/35542/console
3 generic threads were blocked for processing publication response (it was 3 node cluster in test)
All the threads were probably waiting on below mutex :
https://github.com/opensearch-project/OpenSearch/blob/f3d2beee637f63e38c8f26dbcee9f2a82f9c87b6/server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java#L1376-L1392
Related component
Cluster Manager
To Reproduce
Expected behavior
Investigate which code path was holding mutex and if it can be optimized (lock duration for code path which was holding it). Right now, it is not clear for how long the threads were blocked.
Additional Details
No response