open-telemetry / opentelemetry-java-contrib

https://opentelemetry.io
Apache License 2.0
145 stars 118 forks source link

Refactor & merge JMX gatherer & insight implementations #1362

Open SylvainJuge opened 1 week ago

SylvainJuge commented 1 week ago

As discussed in June 20th Java SIG meeting, we have two distinct ways to capture JMX metrics:

As a consequence, we have:

In addition to the duplication of code, the metrics definition themselves are duplicated in two different incompatible formats (YAML and Groovy scripts), which makes the maintenance even harder and further complicates their evolution. For example trying to change definitions in instrumentation like in opentelemetry-java-instrumentation#11621 becomes more complicated than it should.

JMX gatherer allows to capture JMX metrics from outside of the JVM and is used in the collector-contrib repository. Usages of it can be found by searching for usages of otel.jmx.service.url on GH.

JMX gatherer is used in the opentelemetry-collector-contrib repository by the jmxreceiver

Proposal

Removing the user-provided groovy script support is the only breaking change here, the rationale is:

All of this assumes that there are no limitations of the YAML metrics definition to represent the current groovy script definitions. In case of unsolvable incompatibilities we will have to decide on a case-by-case basis.

shivanshuraj1333 commented 4 days ago

Additionally, there are similar agents which does the same thing. 1) https://github.com/prometheus/jmx_exporter 2) https://github.com/jolokia/jolokia

Both of which seems to be more mature than the current OTeL JMX gatherer and insight agents. Also, JMX insight can't be run as an standalone server (no information in the docs), while the other can run as standalone agents. Some potential design considerations while merging the JMX gatherer and insight agents 1) capability to collect MBeans from multiple JMX hosts, useful to use the agent with collector with only one JMX receiver. 2) yaml configuration, maybe similar to configuration from prometheus JMX java agent, it has a nice support for regex based MBean filtering.

SylvainJuge commented 3 days ago

Thanks @shivanshuraj1333 ,

I agree with you for Prometheus JMX exporter and Jolokia, those might be more mature but from what I know about them they fit a "pull-based metrics" model as they expose metrics to HTTP/prometheus, here the goal is to fit the "push-based metrics" model where the metrics are sent to the collector, not pulled by it.

Also, one of the features of Jolokia is to expose the whole (or parts of) JMX interface, which has lots of security implications, so having a dedicated implementation limited to capturing metrics makes sense.

Those two suggestions could really be worth adding as follow-up improvements:

For the wildcards part, do you know if (and how) they are documented in Prometheus ? While it's just a plain mapping of the JMX tree, building a reference of the metrics for building reports might be tricky without knowledge on the implementation.

shivanshuraj1333 commented 3 days ago

I see,

here the goal is to fit the "push-based metrics" model where the metrics are sent to the collector, not pulled by it

But I think today it's a mixture in OTeL, if I use OTeL metrics gatherer as a receiver in collector it becomes pull-based as the gatherer collects MBeans from the configured jmx servers. But if it is used as a standalone agent then it follows the push-based model


For the wildcards part, do you know if (and how) they are documented in Prometheus ? While it's just a plain mapping of the JMX tree, building a reference of the metrics for building reports might be tricky without knowledge on the implementation.

Though it's not documented but looking at the code the flow (on a very high level), looks like this, In Prometheus I can define a rule like below

- pattern : kafka.server<type=(.+), name=(.+), clientId=(.+), topic=(.+), partition=(.*)><>Value
  name: kafka_server_$1_$2
  type: GAUGE
  labels:
    clientId: "$3"
    topic: "$4"
    partition: "$5"

Now when the JMXCollector.java initiates, it captures the rules from pattern, parses the rules into regex and puts it into a rule cache.

When the scraper initiates, it queries MBeans names and puts that into jmxMBeanPropertyCache, ref (this basically contains all the MBeans that are collected), now when the scraper processes the recorded bean values, it checks the recorded beans against the rules cache it created above, ref, and puts them in matchedRules cache, which is then used for creating the final metric snapshot.

cc'ing @dhoard who added some recent commits, and may have some ideas to share...

shivanshuraj1333 commented 3 days ago

I have one more suggestion: should we consider making the JMX receiver in the collector compatible with any supported JMX agent? This would involve significant work, as the configuration would need to be parsed for each supported agent (e.g., Prometheus, Jolokia, OTeL metrics gatherer). Additionally, the receiver would need to internally convert collected metrics into OTLP format, such as converting Prometheus metrics into Prometheus format.

The advantage is that OTeL users could utilize any supported JMX collector. However, the downside is the high maintenance required for the JMX receiver.

dhoard commented 3 days ago

Now when the JMXCollector.java initiates, it captures the rules from pattern parses the rules into regex and puts it into a rule cache.

When the scraper initiates, it queries MBeans names and puts that into jmxMBeanPropertyCache, ref (this basically contains all the MBeans that are collected), now when the scraper processes the recorded bean values, it checks the recorded beans against the rules cache it created above, ref, and puts them in matchedRules cache, which is then used for creating the final metric snapshot.

@shivanshuraj1333 that sounds correct.

FYI - The next Prometheus JMX Exporter release will support OTel push (along with the current Prometheus pull-based model.)

https://github.com/prometheus/jmx_exporter/pull/981

shivanshuraj1333 commented 3 days ago

Thanks @dhoard for chiming in. btw do you've any thoughts on: 1) Does it makes sense to have native support for vendor specific JMX metrics collector jar in JMX receiver? (in this case prometheus JMX exporter), the current JMX receiver helps in configuring the JMX metrics gatherer jar, IMO, it can be extended to other agents as well. 2) (off topic but would be require as knowledge for improving otel metrics collectors) Why prometheus/jmx_exporter doesn't support scraping from multiple JMX sources?

dhoard commented 3 days ago
  1. Not sure I follow "native support for vendor-specific JMX metrics collector jar". Can you elaborate?

  2. There have been requests to add scraping multiple endpoints to the standalone exporter, but I feel that the negatives/code complexity outweigh the positives.

High level...

shivanshuraj1333 commented 3 days ago

I see, that makes sense, probably the merger of OTeL insight and metrics gatherer would have to face the similar challenges, if we plan to support multiple JMX endpoint.

Not sure I follow "native support for vendor-specific JMX metrics collector jar". Can you elaborate?

So, if you look at JMX receiver in the collector, it only supports JMX_metrics_gatherer agent from OTeL.

If we tweak the config of the collector to support any generic agent, the JMX receiver can be extended to support any other agent (like prometheus agent). This would give some luxury to the end user if they want to use newly supported vendor specific agents (like JMX exporter from prometheus).

dhoard commented 3 days ago

Thanks for the clarification!

For an application that has a proprietary metrics API that doesn't implement the Prometheus/OpenMetrics exposition format or OTel metrics push, I could see where it could be used, but I feel writing an application-specific collector is the best approach.

SylvainJuge commented 1 day ago

So if I'm following correctly, we have:

  1. OTel JMX Insights which are part of the OTel agent
  2. OTel JMX Gatherer which is a standalone executable jar used by the OTel collector
  3. Prometheus JMX Exporter that will soon be able to export to OpenTelemetry with https://github.com/prometheus/jmx_exporter/pull/981

So once https://github.com/prometheus/jmx_exporter/pull/981 is merged we have a potential overlap between OTel JMX gatherer and Prometheus exporter as both can be used to export to OTel, and adding support for Prometheus JMX exporter in OTel collector seems a good addition.

The goal here is to merge 1) and 2) implementations so they behave the same way and can capture identical metrics, those two address the need to be able to capture JMX metrics from within and from outside of the JVM.

For me, the fact that JMX Gatherer could potentially be replaced by Prometheus JMX exporter is nice beyond the scope of this refactoring.


Also, when collecting JMX metrics from outside of the JVM (with OTel Gatherer or Prometheus exporter), it seems that trying to connect to multiple JVMs provides some extra complexity if we have to deal with the RMI security/authentication. Maybe it's worth discussing this in a separate dedicated issue, but is there a very large impact to having a small JVM process for the gatherer/exporter run by the collector per monitored remote JVM ?

shivanshuraj1333 commented 1 day ago

Yes, you are correct! Some additional info on top of it. 1) OTel JMX Insights, can run as an agent but not as standalone server, it has some preconfigured rules to extract MBeans from JMX server, can be extended by specifying yaml based configurations to extract MBeans.

2) OTel JMX Gatherer can be run as standalone server, as an agent, or inside collector (within JMX receiver), functionality wise it's analogous to prometheus JMX_exporter, but configuration wise it's quite complex, as one needs to write groovy scripts which defines the actual object in form of instrumentation, which is to me is like writing half of the code as part of configuration, eg

      otel.instrument(otel.mbean("Catalina:type=ThreadPool,name=*"), "tomcat.threads", "description", "1",
      ["proto_handler" : { mbean -> mbean.name().getKeyProperty("name") }],
      ["currentThreadCount": ["Thread Type": {mbean -> "current"}],
      "currentThreadsBusy": ["Thread Type": {mbean -> "busy"}]],
      otel.&doubleValueObserver)
would search/fetch the above MBean through the gatherer's connection with JMX server.

3) JMX_exporter from prometheus is like more advanced version of 2, where user can define simplified configs, and it also caches a lot of things to reduce the number of calls to JMX server.


On the other hand, the JMX receiver on the collector helps in facilitating running the gatherer jar in the collector itself, but should allow running other JARs as well (like JMX_exporter from prometheus). On the collector side, we can create a unified config translation for respective JARs. (and that is what actually it is doing today, but somewhat hardcoded for gatherer).


For me, the fact that JMX Gatherer could potentially be replaced by Prometheus JMX exporter is nice beyond the scope of this refactoring.

+1 to this, if we compare today, in all the cases JMX gatherer can be replaced by Prometheus JMX exporter, but not in collector. As a user if I'm running things on scale, I'd prefer using Prometheus JMX exporter, based on current implementation of gatherer.


AFAIU, the Prometheus JMX exporter is single threaded, IMO to support metrics collections from multiple JMX servers, if we make it multithreaded where each thread is responsible for corresponding RMI belonging to that particular JMX server, I think it can work without the security concerned raised above. ^ this is something we can keep in mind while merging insight and gatherer. and +1 to have a separate issue for this.

shivanshuraj1333 commented 1 day ago

AFAIU, the Prometheus JMX exporter is single threaded, IMO to support metrics collections from multiple JMX servers, if we make it multithreaded where each thread is responsible for corresponding RMI belonging to that particular JMX server, I think it can work without the security concerned raised above.

Particular to this point, @dhoard WDYT? This is mostly relevant when running as a standalone server and not as agent.

dhoard commented 1 day ago

AFAIU, the Prometheus JMX exporter is single threaded

The Prometheus JMX Exporter...

As a data point, in the past, Maven Central statistics show that the Prometheus JMX Exporter Java agent is primarily being used (by a huge margin) over the standalone exporter.