opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.77k stars 1.82k forks source link

[Feature Request] Latency metrics in ClusterManager [Appliers, Listeners, Reroute..] #12332

Open gargharsh3134 opened 9 months ago

gargharsh3134 commented 9 months ago

Is your feature request related to a problem? Please describe

Given the introduction of Request Tracing Framework (RTF) using OpenTelemetry (OTel), metrics (histogram/counter) can now be published and used to track high latency operations. This issue tracks the instrumentation for introducing latency metrics in ClusterManager which can help identify scaling bottlenecks.

The following metrics can be added to start with:

  1. Committing any change in ClusterState involves running Appliers and Listeners, which are supposed to be very light weight operations. Tracking latency metrics for such operations will help in identifying potential bottlenecks which can slow down the ability of ClusterManager to process the pending tasks queue.

  2. Metric to track latency of reroute operation.

  3. Latency while computing new cluster state upon any change and time taken to successfully publish that state to other nodes.

Describe the solution you'd like

OTel Histogram Metrics: Support for Histogram type metrics, which was added as part of #12062, can be utilised to publish the metrics for each use case.

Related component

Cluster Manager

Describe alternatives you've considered

No response

Additional context

No response

peternied commented 8 months ago

[Triage - attendees 1 2 3 4 5 6] @gargharsh3134 Thanks for filing, looking forward to this improvement