Refactor & merge JMX gatherer & insight implementations

SylvainJuge commented 4 months ago

As discussed in June 20th Java SIG meeting, we have two distinct ways to capture JMX metrics:

with instrumentation agent through the JMX insight feature
with contrib through the JMX gatherer (in this repository)

As a consequence, we have:

JMX gatherer is using groovy scripts for metrics definition
JMX insight is using YAML files for metrics definition
both JMX insight and JMX gatherer allow to provide custom user-provided configuration using respectively a groovy script and a yaml file
JMX gatherer provides ready-to-use metrics configuration for some target systems (tomcat, kafka, cassandra, ...)
JMX insight provides ready-to-use metrics configuration for a few of the target systems supported by JMX gatherer.

In addition to the duplication of code, the metrics definition themselves are duplicated in two different incompatible formats (YAML and Groovy scripts), which makes the maintenance even harder and further complicates their evolution. For example trying to change definitions in instrumentation like in opentelemetry-java-instrumentation#11621 becomes more complicated than it should.

JMX gatherer allows to capture JMX metrics from outside of the JVM and is used in the collector-contrib repository. Usages of it can be found by searching for usages of otel.jmx.service.url on GH.

JMX gatherer is used in the opentelemetry-collector-contrib repository by the jmxreceiver

Proposal

reuse YAML-based implementation of JMX insight (instrumentation) for JMX gatherer implementation
replace target system metrics definition with YAML equivalents
drop support for user-provided groovy scripts, they would have to be replaced by YAML equivalent
for now keep the metrics definitions identical to their current state to preserve compatibility
aligning any metrics definition differences (if there is any) would be handled through follow-up updates to the YAML definitions.

Removing the user-provided groovy script support is the only breaking change here, the rationale is:

it is still possible (and easier) to capture custom metrics with YAML, so the feature is preserved
maintenance of YAML files is expected to be simpler and allows to keep insight/gatherer in sync
usage of otel.jmx.groovy.script in opentelemetry-collector-contrib was removed 2 years ago for security reasons (opentelemetry-collector-contrib#6750 and opentelemetry-collector-contrib#9721)
searching for references to otel.jmx.groovy.script on GH returns mostly point to old forks of the opentelemetry-collector-contrib, so we can infer this feature is probably rarely used

All of this assumes that there are no limitations of the YAML metrics definition to represent the current groovy script definitions. In case of unsolvable incompatibilities we will have to decide on a case-by-case basis.

Implementation steps

The current path to implement this brings a few breaking changes so we decided to create a new "JMX Scraper" component that will provide an alternative to "JMX Gatherer" for an easier migration.

[x] bootstrap a new JMX Scraper module: #1445
[x] allow reusing JMX Insights with remote connections https://github.com/open-telemetry/opentelemetry-java-instrumentation/pull/12178 (part of instrumentation 2.9.0 release).
[ ] implement & test SSL connection: server certificate + SSL RMI registry
[ ] test SSL connection: client certificate
[x] implement and test for a single target system tomcat with definitions that match tomcat.groovy https://github.com/open-telemetry/opentelemetry-java-contrib/pull/1485
[ ] publish and document JMX Scraper "MVP" for feedback and test
[x] maybe add ability to aggregate some mbean metrics to replicate https://github.com/open-telemetry/opentelemetry-java-contrib/pull/1366 in contrib, however this might also be covered by better metrics definition that use MBean key/values.
[ ] implement and test other target systems
- [x] activemq #1497
- [x] cassandra #1515
- [ ] hadoop
- [ ] hbase
- [x] jetty #1517
- [x] jvm #1485
- [ ] kafka-consumer
- [ ] kafka-producer
- [ ] kafka
- [ ] solr
- [x] tomcat #1485
- [ ] wildfly #1531
[ ] document usage and configuration
[ ] make otel collector jmx receiver able to use it

shivanshuraj1333 commented 4 months ago

Additionally, there are similar agents which does the same thing. 1) https://github.com/prometheus/jmx_exporter 2) https://github.com/jolokia/jolokia

Both of which seems to be more mature than the current OTeL JMX gatherer and insight agents. Also, JMX insight can't be run as an standalone server (no information in the docs), while the other can run as standalone agents. Some potential design considerations while merging the JMX gatherer and insight agents 1) capability to collect MBeans from multiple JMX hosts, useful to use the agent with collector with only one JMX receiver. 2) yaml configuration, maybe similar to configuration from prometheus JMX java agent, it has a nice support for regex based MBean filtering.

SylvainJuge commented 4 months ago

Thanks @shivanshuraj1333 ,

I agree with you for Prometheus JMX exporter and Jolokia, those might be more mature but from what I know about them they fit a "pull-based metrics" model as they expose metrics to HTTP/prometheus, here the goal is to fit the "push-based metrics" model where the metrics are sent to the collector, not pulled by it.

Also, one of the features of Jolokia is to expose the whole (or parts of) JMX interface, which has lots of security implications, so having a dedicated implementation limited to capturing metrics makes sense.

Those two suggestions could really be worth adding as follow-up improvements:

collect from multiple JMX hosts
support regex/wildcards in yaml configuration

For the wildcards part, do you know if (and how) they are documented in Prometheus ? While it's just a plain mapping of the JMX tree, building a reference of the metrics for building reports might be tricky without knowledge on the implementation.

shivanshuraj1333 commented 4 months ago

I see,

here the goal is to fit the "push-based metrics" model where the metrics are sent to the collector, not pulled by it

But I think today it's a mixture in OTeL, if I use OTeL metrics gatherer as a receiver in collector it becomes pull-based as the gatherer collects MBeans from the configured jmx servers. But if it is used as a standalone agent then it follows the push-based model

For the wildcards part, do you know if (and how) they are documented in Prometheus ? While it's just a plain mapping of the JMX tree, building a reference of the metrics for building reports might be tricky without knowledge on the implementation.

Though it's not documented but looking at the code the flow (on a very high level), looks like this, In Prometheus I can define a rule like below

- pattern : kafka.server<type=(.+), name=(.+), clientId=(.+), topic=(.+), partition=(.*)><>Value
  name: kafka_server_$1_$2
  type: GAUGE
  labels:
    clientId: "$3"
    topic: "$4"
    partition: "$5"

Now when the JMXCollector.java initiates, it captures the rules from pattern, parses the rules into regex and puts it into a rule cache.

When the scraper initiates, it queries MBeans names and puts that into jmxMBeanPropertyCache, ref (this basically contains all the MBeans that are collected), now when the scraper processes the recorded bean values, it checks the recorded beans against the rules cache it created above, ref, and puts them in matchedRules cache, which is then used for creating the final metric snapshot.

cc'ing @dhoard who added some recent commits, and may have some ideas to share...

shivanshuraj1333 commented 4 months ago

I have one more suggestion: should we consider making the JMX receiver in the collector compatible with any supported JMX agent? This would involve significant work, as the configuration would need to be parsed for each supported agent (e.g., Prometheus, Jolokia, OTeL metrics gatherer). Additionally, the receiver would need to internally convert collected metrics into OTLP format, such as converting Prometheus metrics into Prometheus format.

The advantage is that OTeL users could utilize any supported JMX collector. However, the downside is the high maintenance required for the JMX receiver.

dhoard commented 4 months ago

Now when the JMXCollector.java initiates, it captures the rules from pattern parses the rules into regex and puts it into a rule cache.

When the scraper initiates, it queries MBeans names and puts that into jmxMBeanPropertyCache, ref (this basically contains all the MBeans that are collected), now when the scraper processes the recorded bean values, it checks the recorded beans against the rules cache it created above, ref, and puts them in matchedRules cache, which is then used for creating the final metric snapshot.

@shivanshuraj1333 that sounds correct.

FYI - The next Prometheus JMX Exporter release will support OTel push (along with the current Prometheus pull-based model.)

https://github.com/prometheus/jmx_exporter/pull/981

shivanshuraj1333 commented 4 months ago

Thanks @dhoard for chiming in. btw do you've any thoughts on: 1) Does it makes sense to have native support for vendor specific JMX metrics collector jar in JMX receiver? (in this case prometheus JMX exporter), the current JMX receiver helps in configuring the JMX metrics gatherer jar, IMO, it can be extended to other agents as well. 2) (off topic but would be require as knowledge for improving otel metrics collectors) Why prometheus/jmx_exporter doesn't support scraping from multiple JMX sources?

dhoard commented 4 months ago

Not sure I follow "native support for vendor-specific JMX metrics collector jar". Can you elaborate?
There have been requests to add scraping multiple endpoints to the standalone exporter, but I feel that the negatives/code complexity outweigh the positives.

High level...

Enterprises typically require high availability/requiring at least 2 instances + a load balancer. Recent related issue/discussion: https://github.com/prometheus/jmx_exporter/issues/974
Some metrics are collected when using the Java agent version that aren't easily/can't be collected using the standalone exporter.
- https://github.com/prometheus/client_java/tree/main/prometheus-metrics-instrumentation-jvm/src/main/java/io/prometheus/metrics/instrumentation/jvm
If all target endpoints are using the same RMI configuration/security, it's a security attack vector.
If target endpoints are using target-specific RMI configuration/security, the code complexity/integration testing and configuration/management becomes unmanageable.

shivanshuraj1333 commented 4 months ago

I see, that makes sense, probably the merger of OTeL insight and metrics gatherer would have to face the similar challenges, if we plan to support multiple JMX endpoint.

Not sure I follow "native support for vendor-specific JMX metrics collector jar". Can you elaborate?

So, if you look at JMX receiver in the collector, it only supports JMX_metrics_gatherer agent from OTeL.

If we tweak the config of the collector to support any generic agent, the JMX receiver can be extended to support any other agent (like prometheus agent). This would give some luxury to the end user if they want to use newly supported vendor specific agents (like JMX exporter from prometheus).

dhoard commented 4 months ago

Thanks for the clarification!

For an application that has a proprietary metrics API that doesn't implement the Prometheus/OpenMetrics exposition format or OTel metrics push, I could see where it could be used, but I feel writing an application-specific collector is the best approach.

SylvainJuge commented 4 months ago

So if I'm following correctly, we have:

OTel JMX Insights which are part of the OTel agent
OTel JMX Gatherer which is a standalone executable jar used by the OTel collector
Prometheus JMX Exporter that will soon be able to export to OpenTelemetry with https://github.com/prometheus/jmx_exporter/pull/981

So once https://github.com/prometheus/jmx_exporter/pull/981 is merged we have a potential overlap between OTel JMX gatherer and Prometheus exporter as both can be used to export to OTel, and adding support for Prometheus JMX exporter in OTel collector seems a good addition.

The goal here is to merge 1) and 2) implementations so they behave the same way and can capture identical metrics, those two address the need to be able to capture JMX metrics from within and from outside of the JVM.

For me, the fact that JMX Gatherer could potentially be replaced by Prometheus JMX exporter is nice beyond the scope of this refactoring.

Also, when collecting JMX metrics from outside of the JVM (with OTel Gatherer or Prometheus exporter), it seems that trying to connect to multiple JVMs provides some extra complexity if we have to deal with the RMI security/authentication. Maybe it's worth discussing this in a separate dedicated issue, but is there a very large impact to having a small JVM process for the gatherer/exporter run by the collector per monitored remote JVM ?

shivanshuraj1333 commented 4 months ago

Yes, you are correct! Some additional info on top of it. 1) OTel JMX Insights, can run as an agent but not as standalone server, it has some preconfigured rules to extract MBeans from JMX server, can be extended by specifying yaml based configurations to extract MBeans.

2) OTel JMX Gatherer can be run as standalone server, as an agent, or inside collector (within JMX receiver), functionality wise it's analogous to prometheus JMX_exporter, but configuration wise it's quite complex, as one needs to write groovy scripts which defines the actual object in form of instrumentation, which is to me is like writing half of the code as part of configuration, eg

      otel.instrument(otel.mbean("Catalina:type=ThreadPool,name=*"), "tomcat.threads", "description", "1",
      ["proto_handler" : { mbean -> mbean.name().getKeyProperty("name") }],
      ["currentThreadCount": ["Thread Type": {mbean -> "current"}],
      "currentThreadsBusy": ["Thread Type": {mbean -> "busy"}]],
      otel.&doubleValueObserver)

would search/fetch the above MBean through the gatherer's connection with JMX server.

3) JMX_exporter from prometheus is like more advanced version of 2, where user can define simplified configs, and it also caches a lot of things to reduce the number of calls to JMX server.

On the other hand, the JMX receiver on the collector helps in facilitating running the gatherer jar in the collector itself, but should allow running other JARs as well (like JMX_exporter from prometheus). On the collector side, we can create a unified config translation for respective JARs. (and that is what actually it is doing today, but somewhat hardcoded for gatherer).

For me, the fact that JMX Gatherer could potentially be replaced by Prometheus JMX exporter is nice beyond the scope of this refactoring.

+1 to this, if we compare today, in all the cases JMX gatherer can be replaced by Prometheus JMX exporter, but not in collector. As a user if I'm running things on scale, I'd prefer using Prometheus JMX exporter, based on current implementation of gatherer.

AFAIU, the Prometheus JMX exporter is single threaded, IMO to support metrics collections from multiple JMX servers, if we make it multithreaded where each thread is responsible for corresponding RMI belonging to that particular JMX server, I think it can work without the security concerned raised above. ^ this is something we can keep in mind while merging insight and gatherer. and +1 to have a separate issue for this.

shivanshuraj1333 commented 4 months ago

AFAIU, the Prometheus JMX exporter is single threaded, IMO to support metrics collections from multiple JMX servers, if we make it multithreaded where each thread is responsible for corresponding RMI belonging to that particular JMX server, I think it can work without the security concerned raised above.

Particular to this point, @dhoard WDYT? This is mostly relevant when running as a standalone server and not as agent.

dhoard commented 4 months ago

AFAIU, the Prometheus JMX exporter is single threaded

The Prometheus JMX Exporter...

is designed to target a single server
the HTTP endpoint is multi-threaded for both modes (agent and standalone.)
- multiple HTTP clients can call scrape in parallel
- each call will perform a metrics collection
the pending OTel integration is single-threaded
- i.e. on an interval, collect metrics and push

As a data point, in the past, Maven Central statistics show that the Prometheus JMX Exporter Java agent is primarily being used (by a huge margin) over the standalone exporter.

open-telemetry / opentelemetry-java-contrib

Refactor & merge JMX gatherer & insight implementations #1362

Proposal

Implementation steps