Attempt to align and cleanup some jmx metrics

SylvainJuge commented 1 week ago

This is an attempt to fix a few errors and inconsistencies that I've found in the JMX metrics captured with the JMX Metric Insight feature.

I have intentionally limited the scope to tomcat, jetty and wildfly, but similar changes might be applied to other systems as a follow-up.

clarify the strategy for pre-defined metrics
simplify metric prefix for tomcat to tomcat. for consistency with jetty, wildfly, ...
fix tomcat busy/idle threads, as explained in this discussion.
rename metrics to use singular form to fit (experimental) metrics semconv recommendations.
rename units to use singular form to fit (stable) units semconv recommendations.
move tomcat request-related metrics to tomcat.request.* namespace for consistency with wildfly.request.*
align tomcat on system.network.io with tomcat.network.io for transferred bytes
align wildfly on system.network.io with: wildfly.network.io for transferred bytes (only direction attribute had to be changed).

For the overall strategy, I agree that covering every metric of every platform is not possible nor something we aim to. For example, with Wildfly the db pool exposes more than 50 attributes that could be captured as metrics.

I think one of the important things that could make this type of mapping somehow manageable over time is to use the following strategy for metric names and their attributes:

use or align to semconv when it fits
keep the MBean attribute name otherwise: it allows to preserve the semantics of the observed system without having to try re-defining common metrics or deal with subtle implementation details.

Checklist & follow-ups

[ ] wildfly.db.client.connection check with impl. that state can be a partition active/idle/wait (in which case using a single metric + attribute would make sense), but the wildfly documentation seems to imply it's not the case.
[x] wildfly.db.client.connection should use the db.client.connections.state from semconv for the connectíon state.
- update: plural form was removed in semconv in https://github.com/open-telemetry/semantic-conventions/pull/1125, to be released in 1.27.
[x] fix case for wildfly.db.client.transaction.NumberOfTransactions should probably be using MBean attribute so numberOfTransactions
[x] maybe try to fix the database semconv metrics attributes to use singular form, for example db.client.connections.pool.name should probably be renamed to db.client.connections.pool.name
- update: already covered in https://github.com/open-telemetry/semantic-conventions/pull/1125

PeterF778 commented 1 week ago

Originally, JMX Metric Insight borrowed the metric definitions from JMX Metric Gatherer, and was bug-for-bug compatible. This was caused equally by our laziness as by the desire to allow users to transition smoothly to in-process metric collection. I do not know how popular JMX Metric Insight is, but I know from experience that changes to metric names/attributes can sometimes be painful for the users. Perhaps it will be helpful for the customers if we keep the old metric configuration files around for some time as tomcat_old or tomcat_legacy etc.

SylvainJuge commented 1 week ago

Originally, JMX Metric Insight borrowed the metric definitions from JMX Metric Gatherer, and was bug-for-bug compatible. This was caused equally by our laziness as by the desire to allow users to transition smoothly to in-process metric collection. I do not know how popular JMX Metric Insight is, but I know from experience that changes to metric names/attributes can sometimes be painful for the users. Perhaps it will be helpful for the customers if we keep the old metric configuration files around for some time as tomcat_old or tomcat_legacy etc.

I completely understand the duplication strategy here, but it's probably time to remove the duplication and simplify things:

there are a couple of issues about this https://github.com/open-telemetry/opentelemetry-java-contrib/issues/736 and https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/9765
JMX gatherer supports more target systems than the JMX insight
changing anything requires two PRs in two repositories, for example https://github.com/open-telemetry/opentelemetry-java-instrumentation/pull/10115 and https://github.com/open-telemetry/opentelemetry-java-contrib/pull/1269
it seems to me that the static yaml file definition covers what is currently captured through groovy scripts in JMX gatherer (I could be completely wrong on this one)

Until we have such duplication removed, we will have to backport such changes in the contrib repo. Implementation-wise, a common implementation would likely reside in the contrib repo and be included in the instrumentation agent (there are already similar dependencies for the aws and gcp resource providers).

Regarding compatibility, I really don't know what should be the best approach here, all the JMX metrics are very dependent on implementation details, having any formal definition in semconv and stability status for them is not possible. Maybe keeping previous iterations of the yaml files could provide this.

SylvainJuge commented 2 days ago

Status following June 20th SIG meeting:

we need to first align the implementations in contrib/instrumentation while preserving current metrics compatibility
we can provide updated version of the metrics in the instrumentation side with opt-in to use those new definitions (by default on the current state for compatibility)
switching to the new metrics could be aligned with the next major that should be around the stable database semconv.
this PR will stay in draft until then, parts of it will of course be reused along the way.

open-telemetry / opentelemetry-java-instrumentation

Attempt to align and cleanup some jmx metrics #11621

Checklist & follow-ups