JMX implementation : feature parity for target systems

open-telemetry / opentelemetry-java-instrumentation

OpenTelemetry auto-instrumentation and instrumentation libraries for Java

https://opentelemetry.io

Apache License 2.0

1.99k stars 868 forks source link

JMX implementation : feature parity for target systems #12158

Open SylvainJuge opened 2 months ago

SylvainJuge commented 2 months ago

JMX Insights supports some values for otel.jmx.target.system, those are defined in YAML files here.

JMX Gatherer (in contrib) supports more values of otel.jmx.target.system, those are defined in Groovy scripts here.

While the Groovy scripts are convenient, moving to YAML seems a more future-proof solution:

removes security risk of having executable groovy scripts
YAML syntax is already widespread and usually do not require java/groovy knowledge
YAML syntax could later allow to inline the configuration in a global OpenTelemetry YAML configuration when such would be available, for now it has to be stored in a separate file.

Merging both implementations and bringing them to feature parity means that we have to attempt migrate/align all of the JMX Gatherer supported systems and ensure they can be implemented with YAML. Doing so will highlight any missing feature of the YAML implementation by adding any missing part.

Once the alignment is complete, we should then be able to start on the next step: building a "JMX Scraper" in contrib based on the YAML implementation in instrumentation.

For each system listed below, we need to ensure the following with JMX Insights

add YAML if system is not supported yet
convert groovy metrics to their YAML equivalent
deal with any found inconsistency for existing metrics by choosing to
- leave them as-is
- fix YAML or Groovy definitions (or both)
add any missing feature to YAML implementation if needed

List of systems to cover:

[ ] activemq activemq.groovy
[ ] cassandra cassandra.groovy
- mapping differences, nothing available in YAML
[ ] hadoop hadoop.groovy
- mapping differences
[ ] hbase hbase.groovy
- mapping differences there is no YAML definition
[ ] jetty jetty.groovy
- mapping differences
[ ] jvm jvm.groovy
- mapping differences
[ ] kafka kafka.groovy
- mapping seems identical (but to be checked in detail)
[ ] kafka-consumer kafka-consumer.groovy
- no mapping in YAML
[ ] kafka-producer kafka-producer.groovy
- no mapping in YAML
[ ] solr solr.groovy
- no mapping in YAML
[ ] tomcat tomcat.groovy
- mapping differences
[ ] wildfly wildfly.groovy
- mapping differences

Once feature parity is achieved and JMX Scraper allows to capture both:

current JMX Gatherer metrics
current JMX Insight metrics (maybe as opt-in)

Then we can start the next step to enhance and align the metrics as the initial attempt in https://github.com/open-telemetry/opentelemetry-java-instrumentation/pull/11621

When doing so, special care should be taken to ensure that we conform to current guidelines for metrics defined here, for example:

units using {noun} instead of 1
metric name with a namespace
metric attributes with a namespace
maybe defining a common strategy to map existing JMX metrics with minimal definition (for example stay close to MBean attribute name by default, but it's just a random thought)

Follow-up tasks

[ ] open issue to enhance jmx metrics (maybe system per system)

SylvainJuge commented 2 months ago

Ping @robsunday I can't yet co-assign you as you are not part of the otel contributors group.

SylvainJuge commented 2 months ago

For Tomcat, the mapping is not the same but almost equivalent, there isn't anything we need to add for 1:1 support beyond aligning the metrics themselves.

Side note: using JMX object names and attributes is a convenient way to identify elements, as it's a common part between the two mappings.

JMX : Catalina:type=Manager,host=localhost,context=* or Tomcat:type=GlobalRequestProcessor,name=*
- activeSessions : tomcat.sessions (no attribute) <==> http.server.tomcat.sessions.activeSessions with context attribute
JMX: Catalina:type=GlobalRequestProcessor,name=* or Catalina:type=GlobalRequestProcessor,name=*
- JMX Gatherer: name => proto_handler, JMX Insight: name => name
- errorCount: tomcat.errors with proto_handler attribute <==> http.server.tomcat.errorCount with name attribute
- requestCount: tomcat.request_count with proto_handler attribute <==> http.server.tomcat.requestCount with name attribute
- maxTime: tomcat.max_time with proto_handler attribute <==> http.server.tomcat.maxTime with name attribute
- processingTime: tomcat.processing_time with proto_handler attribute <==> http.server.tomcat.processingTime with name attribute
- bytesReceived: tomcat.traffic with proto_handler and direction = received|sent <==> http.server.tomcat.traffic with name, direction identical
JMX: Catalina:type=ThreadPool,name=* or Tomcat:type=ThreadPool,name=*
- JMX Gatherer: name => proto_handler, JMX Insight: name => name
- currentThreadCount : tomcat.threads with state = idle <==> http.server.tomcat.threads with name , state identical (state=idle reports the total number of threads, which is a bug mentioned here and here)
- currentThreadsBusy: tomcat.threads with state = busy <==> http.server.tomcat.threads with name and state identical

Given the mapping differences, I think here we need we probably need to leave it as-is for now.

robsunday commented 2 months ago

I'll look on Jetty

SylvainJuge commented 2 months ago

For Wildfly, the mapping is also not the same but equivalent, there isn't anything we need to add for 1:1 support beyond aligning the metrics themselves.

JMX: jboss.as:deployment=*,subsystem=undertow
- Both map deployment => deployment attribute
- sessionsCreated: wildfly.session.count <==> wildfly.session.sessionsCreated
- activeSessions: wildfly.session.active <==> wildfly.session.activeSessions
- expiredSessions: wildfly.session.expired <==> wildfly.session.expiredSessions
- rejectedSessions: wildfly.session.rejected <==> wildfly.session.rejectedSessions
JMX: jboss.as:subsystem=undertow,server=*,http-listener=*
- Both map server => server attribute and http-listener => value of listener
- requestCount: wildfly.request.count <==> wildfly.request.requestCount
- processingTime: wildfly.request.time <==> wildfly.request.processingTime
- errorCount: wildfly.request.server_error <==> wildfly.request.errorCount
- bytesSent: wildfly.network.io with extra state = out attribute <==> same
- bytesReceived: wildfly.network.io with extra state = in attribute <==> same
JMX: jboss.as:subsystem=datasources,data-source=*,statistics=pool
- Both map data-source => value of data_source
- ActiveCount : wildfly.jdbc.connection.open with state = active <==> wildfly.db.client.connections.usage with state = used
- IdleCount : wildfly.jdbc.connection.open with state = idle <==> wildfly.db.client.connections.usage with state = idle
- WaitCount: wildfly.jdbc.request.wait <==> wildfly.db.client.connections.WaitCount
JMX: jboss.as:subsystem=transactions
- numberOfTransactions: wildfly.jdbc.transaction.count <==> wildfly.db.client.transaction.NumberOfTransactions
- numberOfSystemRollbacks: wildfly.jdbc.rollback.count with cause = system <==> wildfly.db.client.rollback.count with cause = system
- numberOfResourceRollbacks: wildfly.jdbc.rollback.count with cause = resource <==> wildfly.db.client.rollback.count with cause = resource
- numberOfApplicationRollbacks: wildfly.jdbc.rollback.count with cause = application <==> wildfly.db.client.rollback.count with cause = application

SylvainJuge commented 2 months ago

For JVM metrics, the JMX Insight does not provide a YAML file, the feature is implemented in the runtime-metrics module of instrumentation (link). The current definition is aligned with semantic conventions for JVM metrics.

JMX Gatherer provides the following metrics that are not aligned with semconv, all of those can be easily captured with the YAML configuration:

java.lang:type=ClassLoading:
- LoadedClassCount : jvm.classes.loaded
java.lang:type=GarbageCollector,* :
- CollectionCount: jvm.gc.collections.count with name => name
- CollectionTime: jvm.gc.collections.elapsed with name => name
java.lang:type=Memory
- HeapMemoryUsage: jvm.memory.heap
- NonHeapMemoryUsage: jvm.memory.nonheap
java.lang:type=MemoryPool,*
- Usage: jvm.memory.pool with name => name
java.lang:type=Threading:
- ThreadCount : jvm.threads.count

SylvainJuge commented 2 months ago

As a side note, after reviewing differences for jvm, tomcat and wildfly, it becomes more and more obvious to me that there are too many differences to fix. Also, the groovy definitions haven't been modified in 2 or 3 years for some, which means they are very probably obsolete or not really used in practice.

As a consequence, I think the better option for now is to:

finish reviewing the mapping to ensure we can reproduce it with YAML in JMX Gatherer

The steps that will likely follow are:

build a new module that will use the JMX Insight implementation in contrib next to JMX Gatherer
provide a set of YAML definitions for this new module to capture the metrics as they currently are (just to preserve compatibility)
modify the collector jmxreciver implementation to use this new way to capture JMX metrics
start deprecating the current JMX Gatherer
start improving the metrics definitions so we have a set of common YAML definitions that can be reused between Instrumentation and Contrib (from the consumer side of those metrics, they should be exactly the same).

robsunday commented 2 months ago

Here are my findings regarding jetty:

JMX: org.eclipse.jetty.server.session:context=*,type=sessionhandler,id=*
- MBean property: sessionsCreated --> YAML: jetty.session.sessionsCreated <==> Groovy: jetty.session.count
- MBean property: sessionTimeTotal --> YAML: jetty.session.sessionTimeTotal <==> Groovy: jetty.session.time.total
- minor difference in type: YAML: counter / Groovy: UpDownCounter
- MBean property: sessionTimeMax --> YAML: jetty.session.sessionTimeMax <==> Groovy: jetty.session.time.max
- MBean property: sessionTimeMean --> YAML: jetty.session.sessionTimeMean, not used in Groovy
JMX: org.eclipse.jetty.util.thread:type=queuedthreadpool,id=*
- MBean property: busyThreads --> YAML: jetty.threads.busyThreads <==> Groovy: jetty.thread.count with extra state=busy attribute
  - minor difference in type: YAML: updowncounter / Groovy: Value
- MBean property: idleThreads --> YAML: jetty.threads.idleThreads <==> Groovy: jetty.thread.count with extra state=idle attribute
  - minor difference in type: YAML: updowncounter / Groovy: Value
- MBean property: maxThreads --> YAML: jetty.threads.maxThreads, not used in Groovy
- MBean property: queueSize --> YAML: jetty.threads.queueSize <==> Groovy: jetty.thread.queue.count
  - minor difference in type: YAML: updowncounter / Groovy: Value
JMX: org.eclipse.jetty.io:context=*,type=managedselector,id=*
- MBean property: selectCount --> YAML: jetty.io.selectCount <==> Groovy: jetty.select.count
  - difference in units: YAML: 1 / Groovy: {operations}
JMX: org.eclipse.jetty.logging:type=jettyloggerfactory,id=* not used in Groovy

SylvainJuge commented 2 months ago

For hbase, there isn't anything in JMX Insight for it, the mappings are simple and it should be quite straightforward (but a bit tedious) to produce an equivalent YAML to hbase.groovy.

SylvainJuge commented 2 months ago

For hadoop:

JMX attribute tag.Hostname is always mapped to node_name metric attribute in both implementations.

JMX Hadoop:service=NameNode,name=FSNamesystem:

CapacityUsed : hadoop.name_node.capacity.usage <==> hadoop.capacity.CapacityUsed
CapacityTotal: hadoop.name_node.capacity.limit <==> hadoop.capacity.CapacityTotal
BlocksTotal: hadoop.name_node.block.count <==> hadoop.block.BlocksTotal
MissingBlocks: hadoop.name_node.block.missing <==> hadoop.block.MissingBlocks
CorruptBlocks: hadoop.name_node.block.corrupt <==> hadoop.block.CorruptBlocks
VolumeFailuresTotal: hadoop.name_node.volume.failed <==> hadoop.volume.VolumeFailuresTotal
FilesTotal: hadoop.name_node.file.count <==> hadoop.file.FilesTotal
TotalLoad: hadoop.name_node.file.load <==> hadoop.file.TotalLoad
NumLiveDataNodes: hadoop.name_node.data_node.count with state = live <==> hadoop.datenode.Count, same state value (yes, there is a typo in datanode)
NumDeadDataNodes: hadoop.name_node.data_node.count with state = dead <==> hadoop.hadoop.datenode.Count, same state value

SylvainJuge commented 2 months ago

For cassandra:

There is no mapping in YAML, the mapping is verbose and the lack of support for templates or string interpolation would make it quite tedious to write, but it's more an annoyance than a really blocking issue.

For example, few examples of MBeans:

org.apache.cassandra.metrics:type=ClientRequest
org.apache.cassandra.metrics:type=ClientRequest,scope=RangeSlice
org.apache.cassandra.metrics:type=ClientRequest,scope=Read
org.apache.cassandra.metrics:type=ClientRequest,scope=Write
all of above with scope= with 3 variants by adding ,name= with value in Unavailables, Timeouts or Failures
org.apache.cassandra.metrics:type=Storage,name=Load

There isn't anything that could not be mapped using YAML syntax.

robsunday commented 2 months ago

For activemq everything except property descriptions seems to be in sync. Metric attributes are consitent.

JMX: org.apache.activemq:type=Broker,brokerName=*,destinationType=Queue,destinationName=* and org.apache.activemq:type=Broker,brokerName=*,destinationType=Topic,destinationName=*
- ProducerCount: activemq.producer.count <==> activemq.ProducerCount
- ConsumerCount: activemq.consumer.count <==> activemq.ConsumerCount
- MemoryPercentUsage: activemq.memory.usage <==> activemq.memory.MemoryPercentUsage
- QueueSize: activemq.message.current <==> activemq.message.QueueSize
- ExpiredCount: activemq.message.expired <==> activemq.message.ExpiredCount
- EnqueueCount: activemq.message.enqueued <==> activemq.message.EnqueueCount
- DequeueCount: activemq.message.dequeued <==> activemq.message.DequeueCount
- AverageEnqueueTime: activemq.message.wait_time.avg <==> activemq.message.AverageEnqueueTime

All desc fields in properties needs to be synchronized because wording is different

JMX: org.apache.activemq:type=Broker,brokerName=*
- CurrentConnectionsCount: activemq.connection.count <==> activemq.connections.CurrentConnectionsCount
- StorePercentUsage: activemq.disk.store_usage <==> activemq.disc.StorePercentUsage
- TempPercentUsage: activemq.disk.temp_usage <==> activemq.disc.TempPercentUsage

robsunday commented 2 months ago

solr case is very similar to hbase. No YAML at the moment but creating it should not be an issue.

SylvainJuge commented 2 months ago

For kafka, the YAML is kafka-broker.yaml

JMX: kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec:

Count : kafka.message.count JMX: kafka.server:type=BrokerTopicMetrics,name=TotalProduceRequestsPerSec:
Count: kafka.request.count with type = produce JMX: kafka.server:type=BrokerTopicMetrics,name=TotalFetchRequestsPerSec:
Count: kafka.request.count with type = fetch JMX: kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec:
Count: kafka.request.failed with type = produce JMX: kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec:
Count: kafka.request.failed with type = fetch

I haven't checked in detail all the others, but they look identical between the two implementations.

I discovered that we have a way to use multiple mbeans names with the same metrics definition as seen in kafka-broker.yaml

For kafka-consumer.groovy and kafka-producer.groovy there is no equivalent YAML mapping though.