[RFC] Migrating Metrics From Performance Analyzer to OpenTelemetry Framework

ansjcy commented 8 months ago

Introduction

This RFC proposes migrating metrics from the OpenSearch Performance Analyzer (a plugin designed to gather system and application-level metrics) to OpenTelemetry framework, in light of the recent integration of OpenTelemetry as a trace/metrics collector within OpenSearch, and eventually deprecate the Performance Analyzer plugin.

Background

OpenSearch Performance Analyzer has been a valuable plugin within OpenSearch, offering insights into system and application level performance. With the advancement in observability frameworks and the community's move towards standardization, OpenSearch has integrated OpenTelemetry as a metrics collector, we are now presented with an opportunity to streamline our metrics collection workflow and framework and improve the maintainability and performance of the metrics collection workflow.

Motivation

Unified Metrics Collection: The integration of OpenTelemetry provides us a comprehensive metrics collection framework that can potentially replace the functionality of Performance Analyzer. Consolidating our metrics collection tools will simplify the architecture and reduce the complexity of our system.
Reduce Maintenance Overhead: Maintaining two metrics collection tools is resource-intensive. PA is using a metric collection framrowk built by ourselves to collect metrics and it's not industry standard. By focusing our efforts on a single framework (OpenTelemetry), we can ensure that we provide the best possible support and updates.
Community Adoption: OpenTelemetry has gained significant traction in the community, leading to more integrations, tools, and extensions that our users can benefit from.
Performance: OpenTelemetry is a widely-adopted project with optimizations and improvements being made continuously. Leveraging its capabilities can potentially offer better performance and resource utilization compared to maintaining our custom solution (PA/RCA).

Proposal

Deprecation Notice: We can begin by adding a deprecation notice on the Performance Analyzer's README and documentation. Inform users about the planned deprecation and the timeline for discontinuing support.
Migration Plan: Come up with a detailed migration plan which covers:
- What are the different types of metrics we collect in Performance Analyzer
- For each of the category, how to get the exact same metrics previously gathered by Performance Analyzer using OpenTelemetry.
- For the downstream components that consume PA metrics, how to maintain the consistency.
- Run the new metrics system as shadow mode for some time (?).
Deprecation: After we are confident of the new metrics collection workflow, , officially deprecate the Performance Analyzer.
- Stopping active development and support.
- Archiving the repository or clearly marking it as deprecated.
Removal: In a subsequent major release of OpenSearch, completely remove the Performance Analyzer from the codebase and documentation.

Appendix

Categories of PA (the plugin) Collectors

Host level metrics: collected by directly reading the host/node level metrics.
Service level metrics: collected directly from OpenSearch application, it uses the OpenSearchResource object with is created when the PA plugin is loaded and contains the OpenSearch related data like threadPool, environment, indicesService etc.
- Metrics with reflect: involve using java reflection to get metrics from a library
JVM level metrics: collected from JVM directly by using GarbageCollectorMXBean etc.
Service level metrics with API: collected by calling an API.
PA internal metrics: Collects internal metrics from PA/RCA framework, not related to OpenSearch Core.

Collector Name	Type	Details: How are metrics collected	migrate to ..?	Feasible or not?
OSMetricsCollector	Host level metrics	Several customized data generator are created to gather CPU, Disk, Scheduling related metrics by reading the "/proc//task//*" files in a blocking way for all threads on the node. The metrics are then gathered by the `OSMetricsCollector` and forward to the Json file in shared Memory.	Other agent outside of OpenSearch process /OPTL collector	Feasible
DisksCollector	Host level metrics	Customized data generator are created to gather Disk related metrics by reading the "/proc/diskstats" files in a blocking way. The metrics are then gathered by the `DiskCollector` and forward to the Json file in shared Memory.	Other agent outside of OpenSearch process/OPTL collector	Feasible
NetworkInterfaceCollector	Host level metrics	Customized data generator are created to gather Network related metrics by reading the "/proc/net/snmp, /prov/net/snmp6, /proc/net/dev" files in a blocking way. The metrics are then gathered by the `NetworkInterfaceCollector` and forward to the Json file in shared Memory.	Other agent outside of OpenSearch process/OPTL collector	Feasible
HeapMetricsCollector	JVM level metrics	Utilize the GarbageCollectorMXBean and MemoryMXBean in java.lang.management library to get metrics related to JVM	Core	Feasible
GCInfoCollector	JVM level metrics	get GC related info from GarbageCollectorMXBeans	Core	Feasible
CircuitBreakerCollector	Service level metrics	from circuitBreakerService passed from OpenSearch	Core	Feasible.
NodeDetailsCollector	Service level metrics	from clusterService passed from OpenSearch	Core	Feasible
ClusterManagerServiceMetrics	Service level metrics	get the pending tasks stats from clusterService.clusterManagerService	Core	Feasible
ShardStateCollector	Service level metrics	get shard state metrics for each shard in each index using the `routingTable` data within the clusterService passed from OpenSearch	Core	Feasible, but need to check the CPU level metrics comming from threads.
ElectionTermCollector	Service level metrics	Get election term metric from clusterService passed from OpenSearch	Core	Feasible
ThreadPoolMetricsCollector	Service level metrics (with reflection)	Metrics are get from calling the `stats()` function on the threadpool object passed from OpenSearch. we use Java reflection to get the capacity of the threadpool	Core	Feasible. Migrating to core means we can directly send threadpool level metrics without using reflection.
CacheConfigMetricsCollector	Service level metrics (with reflection)	from indicesService passed from OpenSearch, use Java reflection to ensure backward compatibility. The indicesService is provided by DI and the binding is defined here	Core	Feasible.
NodeStatsAllShardsMetricsCollector	Service level metrics (with reflection)	from indicesService passed from OpenSearch, get the increment of the high level stats for all shards by calculating the diff between the previous shard stats	Core	Feasible
NodeStatsFixedShardsMetricsCollector	Service level metrics (with reflection)	Similar to NodeStatsAllShardsMetricsCollector, from indicesService passed from OpenSearch, get more detailed metrics for some specified shards passed by the user with shardsPerCollection	Core	Feasible
ClusterManagerServiceEventMetrics	Service level metrics (with reflection)	get cluster manager task event data from the clusterManagerService Object passed from OpenSearch	Core	Feasible
ClusterManagerThrottlingMetricsCollector	Service level metrics (with reflection)	get throttling metrics from the reflect of org.opensearch.action.support.clustermanager.ClusterManagerThrottlingRetryListener, from the clusterService passed from OpenSearch	Core	Feasible
ClusterApplierServiceStatsCollector	Service level metrics (with reflection)	"ClusterApplierServiceStats is ES is a tracker for total time taken to apply cluster state and thenumber of times it has failed". This collector uses the ClusterApplierService from opensearch.	Core	Feasible
AdmissionControlMetricsCollector	Service level metrics (with reflection)	Use the admissionController from com.sonian.opensearch.http.jetty.throttling.JettyAdmissionControlService in OpenSearch. Get AdmissionControl related metrics.	Core	Feasible
ShardIndexingPressureMetricsCollector	Service level metrics (with reflection)	Get Index pressure related metrics, from clusterService passed from OpenSearch. Using classes like org.opensearch.index.ShardIndexingPressureStore, org.opensearch.index.IndexingPressure, org.opensearch.index.ShardIndexingPressure classes from clusterService	Core	Feasible
FaultDetectionMetricsCollector	PA internal metrics	PA internal queue fault metrics? Get the FaultDetectionHandlerMetricsQueue from org.opensearch.performanceanalyzer.handler.ClusterFaultDetectionStatsHandler and emit metrics based on each entry.	Deprecate	Feasible
StatsCollector	PA internal metrics	PA internal metrics stats collector	deprecate	Feasible

Gaganjuneja commented 2 months ago

@ansjcy, thanks for putting this up. Utilizing the OpenSearch telemetry framework for emitting these metrics does seem promising. The PA plugin generators are already well-written, making them easily reusable. Since these metrics are ideally part of a plugin rather than being merged directly into the core, migrating them to the OpenSearch telemetry framework within the PA plugin sounds like a sensible approach.

thoughts here @reta @backslasht @msfroh @khushbr @Bukhtawar

reta commented 2 months ago

Agree with @Gaganjuneja , the OpenSearch already collects tons of metrics but exposes them through REST APIs, using the newly developed metric providers, we certainly could unify the approach. Thanks @ansjcy !

backslasht commented 2 months ago

+1, I like the idea of migrating the Performance Analyzer plugin metrics into the OpenTelemetry format.

But, would like to understand bit more on deprecation of "Performance Analyzer" plugin part.

are you suggesting to move the logic into a new plugin which will emit these metrics in OTel format and once that is done deprecate "Performance Analyzer" plugin OR
are you suggesting to move the metrics collection into core?

Gaganjuneja commented 2 months ago

Thank you, @reta and @backslasht, for your prompt responses. My suggestion is to retain these metrics within the "Performance Analyzer" plugin for the time being, given its extensive collection of operating system metrics. To facilitate this, we can pass the MetricsRegistry from the core to the Performance Analyzer plugin and initiate the migration of metrics to utilize an OpenTelemetry-based metrics registry for publishing purposes. Eventually, we can deliberate on the feasibility of integrating this plugin entirely into the core, taking into consideration the implications of backporting as well.

dblock commented 2 weeks ago

Catch All Triage - 1 2 3 4 5

opensearch-project / performance-analyzer