opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.8k stars 1.82k forks source link

[BUG] [Potential Issue] Cluster state reponse handling thread blocked #12820

Open shwetathareja opened 7 months ago

shwetathareja commented 7 months ago

Describe the bug

Observed this during gradle-check run for https://github.com/opensearch-project/OpenSearch/pull/12813#issuecomment-2011604246

https://build.ci.opensearch.org/job/gradle-check/35542/console

3 generic threads were blocked for processing publication response (it was 3 node cluster in test)

Thread[id=5851, name=opensearch[node_t0][generic][T#2], state=BLOCKED, group=TGRP-SearchWeightedRoutingIT]
  2>         at org.opensearch.cluster.coordination.Coordinator$5.onResponse(Coordinator.java:1381)
  2>         at org.opensearch.cluster.coordination.PublicationTransportHandler$PublicationContext$3.handleResponse(PublicationTransportHandler.java:442)
  2>         at org.opensearch.cluster.coordination.PublicationTransportHandler$PublicationContext$3.handleResponse(PublicationTransportHandler.java:433)
  2>         at org.opensearch.telemetry.tracing.handler.TraceableTransportResponseHandler.handleResponse(TraceableTransportResponseHandler.java:72)
  2>         at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1501)
  2>         at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:420)
  2>         at org.opensearch.transport.InboundHandler.lambda$handleResponse$3(InboundHandler.java:414)
  2>         at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854)
  2>         at java.****/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
  2>         at java.****/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)

All the threads were probably waiting on below mutex :

https://github.com/opensearch-project/OpenSearch/blob/f3d2beee637f63e38c8f26dbcee9f2a82f9c87b6/server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java#L1376-L1392

Related component

Cluster Manager

To Reproduce

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

Investigate which code path was holding mutex and if it can be optimized (lock duration for code path which was holding it). Right now, it is not clear for how long the threads were blocked.

Additional Details

No response

peternied commented 7 months ago

[Triage - attendees 1 2 3 4 5 6 7] @shwetathareja Thanks for creating this issue