opensearch-project / performance-analyzer

📈 Get detailed performance metrics from your cluster independently of the Java Virtual Machine (JVM)
https://opensearch.org/docs/latest/monitoring-plugins/pa/index/
Apache License 2.0
32 stars 66 forks source link

[RFC] Migrating Metrics From Performance Analyzer to OpenTelemetry Framework #585

Open ansjcy opened 8 months ago

ansjcy commented 8 months ago

Introduction

This RFC proposes migrating metrics from the OpenSearch Performance Analyzer (a plugin designed to gather system and application-level metrics) to OpenTelemetry framework, in light of the recent integration of OpenTelemetry as a trace/metrics collector within OpenSearch, and eventually deprecate the Performance Analyzer plugin.

Background

OpenSearch Performance Analyzer has been a valuable plugin within OpenSearch, offering insights into system and application level performance. With the advancement in observability frameworks and the community's move towards standardization, OpenSearch has integrated OpenTelemetry as a metrics collector, we are now presented with an opportunity to streamline our metrics collection workflow and framework and improve the maintainability and performance of the metrics collection workflow.

Motivation

  1. Unified Metrics Collection: The integration of OpenTelemetry provides us a comprehensive metrics collection framework that can potentially replace the functionality of Performance Analyzer. Consolidating our metrics collection tools will simplify the architecture and reduce the complexity of our system.
  2. Reduce Maintenance Overhead: Maintaining two metrics collection tools is resource-intensive. PA is using a metric collection framrowk built by ourselves to collect metrics and it's not industry standard. By focusing our efforts on a single framework (OpenTelemetry), we can ensure that we provide the best possible support and updates.
  3. Community Adoption: OpenTelemetry has gained significant traction in the community, leading to more integrations, tools, and extensions that our users can benefit from.
  4. Performance: OpenTelemetry is a widely-adopted project with optimizations and improvements being made continuously. Leveraging its capabilities can potentially offer better performance and resource utilization compared to maintaining our custom solution (PA/RCA).

Proposal

Appendix

Categories of PA (the plugin) Collectors

Collector Name Type Details: How are metrics collected migrate to ..? Feasible or not?
OSMetricsCollector Host level metrics Several customized data generator are created to gather CPU, Disk, Scheduling related metrics by reading the "/proc//task//*" files in a blocking way for all threads on the node. The metrics are then gathered by the OSMetricsCollector and forward to the Json file in shared Memory. Other agent outside of OpenSearch process /OPTL collector Feasible
DisksCollector Host level metrics Customized data generator are created to gather Disk related metrics by reading the "/proc/diskstats" files in a blocking way. The metrics are then gathered by the DiskCollector and forward to the Json file in shared Memory. Other agent outside of OpenSearch process/OPTL collector Feasible
NetworkInterfaceCollector Host level metrics Customized data generator are created to gather Network related metrics by reading the "/proc/net/snmp, /prov/net/snmp6, /proc/net/dev" files in a blocking way. The metrics are then gathered by the NetworkInterfaceCollector and forward to the Json file in shared Memory. Other agent outside of OpenSearch process/OPTL collector Feasible
HeapMetricsCollector JVM level metrics Utilize the GarbageCollectorMXBean and MemoryMXBean in java.lang.management library to get metrics related to JVM Core Feasible
GCInfoCollector JVM level metrics get GC related info from GarbageCollectorMXBeans Core Feasible
CircuitBreakerCollector Service level metrics from circuitBreakerService passed from OpenSearch Core Feasible.
NodeDetailsCollector Service level metrics from clusterService passed from OpenSearch Core Feasible
ClusterManagerServiceMetrics Service level metrics get the pending tasks stats from clusterService.clusterManagerService Core Feasible
ShardStateCollector Service level metrics get shard state metrics for each shard in each index using the routingTable data within the clusterService passed from OpenSearch Core Feasible, but need to check the CPU level metrics comming from threads.
ElectionTermCollector Service level metrics Get election term metric from clusterService passed from OpenSearch Core Feasible
ThreadPoolMetricsCollector Service level metrics (with reflection) Metrics are get from calling the stats() function on the threadpool object passed from OpenSearch. we use Java reflection to get the capacity of the threadpool Core Feasible. Migrating to core means we can directly send threadpool level metrics without using reflection.
CacheConfigMetricsCollector Service level metrics (with reflection) from indicesService passed from OpenSearch, use Java reflection to ensure backward compatibility. The indicesService is provided by DI and the binding is defined here Core Feasible.
NodeStatsAllShardsMetricsCollector Service level metrics (with reflection) from indicesService passed from OpenSearch, get the increment of the high level stats for all shards by calculating the diff between the previous shard stats Core Feasible
NodeStatsFixedShardsMetricsCollector Service level metrics (with reflection) Similar to NodeStatsAllShardsMetricsCollector, from indicesService passed from OpenSearch, get more detailed metrics for some specified shards passed by the user with shardsPerCollection Core Feasible
ClusterManagerServiceEventMetrics Service level metrics (with reflection) get cluster manager task event data from the clusterManagerService Object passed from OpenSearch Core Feasible
ClusterManagerThrottlingMetricsCollector Service level metrics (with reflection) get throttling metrics from the reflect of org.opensearch.action.support.clustermanager.ClusterManagerThrottlingRetryListener, from the clusterService passed from OpenSearch Core Feasible
ClusterApplierServiceStatsCollector Service level metrics (with reflection) "ClusterApplierServiceStats is ES is a tracker for total time taken to apply cluster state and thenumber of times it has failed". This collector uses the ClusterApplierService from opensearch. Core Feasible
AdmissionControlMetricsCollector Service level metrics (with reflection) Use the admissionController from com.sonian.opensearch.http.jetty.throttling.JettyAdmissionControlService in OpenSearch. Get AdmissionControl related metrics. Core Feasible
ShardIndexingPressureMetricsCollector Service level metrics (with reflection) Get Index pressure related metrics, from clusterService passed from OpenSearch. Using classes like org.opensearch.index.ShardIndexingPressureStore, org.opensearch.index.IndexingPressure, org.opensearch.index.ShardIndexingPressure classes from clusterService Core Feasible
FaultDetectionMetricsCollector PA internal metrics PA internal queue fault metrics? Get the FaultDetectionHandlerMetricsQueue from org.opensearch.performanceanalyzer.handler.ClusterFaultDetectionStatsHandler and emit metrics based on each entry. Deprecate Feasible
StatsCollector PA internal metrics PA internal metrics stats collector deprecate Feasible
Gaganjuneja commented 2 months ago

@ansjcy, thanks for putting this up. Utilizing the OpenSearch telemetry framework for emitting these metrics does seem promising. The PA plugin generators are already well-written, making them easily reusable. Since these metrics are ideally part of a plugin rather than being merged directly into the core, migrating them to the OpenSearch telemetry framework within the PA plugin sounds like a sensible approach.

thoughts here @reta @backslasht @msfroh @khushbr @Bukhtawar

reta commented 2 months ago

Agree with @Gaganjuneja , the OpenSearch already collects tons of metrics but exposes them through REST APIs, using the newly developed metric providers, we certainly could unify the approach. Thanks @ansjcy !

backslasht commented 2 months ago

+1, I like the idea of migrating the Performance Analyzer plugin metrics into the OpenTelemetry format.

But, would like to understand bit more on deprecation of "Performance Analyzer" plugin part.

  1. are you suggesting to move the logic into a new plugin which will emit these metrics in OTel format and once that is done deprecate "Performance Analyzer" plugin OR
  2. are you suggesting to move the metrics collection into core?
Gaganjuneja commented 2 months ago

Thank you, @reta and @backslasht, for your prompt responses. My suggestion is to retain these metrics within the "Performance Analyzer" plugin for the time being, given its extensive collection of operating system metrics. To facilitate this, we can pass the MetricsRegistry from the core to the Performance Analyzer plugin and initiate the migration of metrics to utilize an OpenTelemetry-based metrics registry for publishing purposes. Eventually, we can deliberate on the feasibility of integrating this plugin entirely into the core, taking into consideration the implications of backporting as well.

dblock commented 2 weeks ago

Catch All Triage - 1 2 3 4 5