Closed lfrancke closed 1 week ago
@lfrancke what is the cutoff date for this decision? IMO, it is ok to wait until after the on-site, but it would be good to have a rough date in mind, and also a list of things that should be tested (so we can continue testing everything else and trust that if this comes in, the necessary things will be re-tested).
From memory we were waiting on https://github.com/prometheus/jmx_exporter/pull/995 (edit: ah yeah I see you linked that).
The maintainer is actively working on this. I hope we'll have a release "soon".
Can we do it the other way around? Tell me what the latest date is you'd accept this change and I'll make sure that we have a solution ready by then.
Can we do it the other way around? Tell me what the latest date is you'd accept this change and I'll make sure that we have a solution ready by then.
Sure thing, how about a week before release (CoB on Friday, 8th November). I think by then it wouldn't be too much effort to add an extra suite of tests no matter which way this goes. We can also extend that somewhat if we feel like we have capacity.
I just tested the latest/current main branch of JMX Exporter and can confirm that it fixes the performance issue.
So, if we don't get a release in time we can build one ourselves and use a patched version.
Let's wait a few days longer.
In SDP 24.7 we upgraded the version of JMX Exporter from 0.20 to 1.0.1. This is the tool which allows us to expose JMX as Prometheus metrics and is in use for Hadoop, HBase, Hive, Kafka, Spark, Trino and ZooKeeper. Unfortunately the version 1.0.1 has a severe performance degradation which has been fixed upstream but is not released yet. This SDP release 24.11 contains a fixed version bringing performance back to normal levels.
In SDP 24.7 we upgraded to JMX Exporter 1.0.1. Unfortunately this caused the performance of the metrics endpoints to degrade severly. We tracked the issue down to a piece of code in the Prometheus Java Client which has since been fixed:
For our next release we need to fix this and we see these options:
Option 1: New upstream JMX Exporter
This requires a new
client_java
release first which we hope to see in the week of October 14 according to a thread on Slack. When that is done we should put up a PR withjmx_exporter
upgrading the client java version and then nicely ask if anyone is up for a newjmx_exporter
release. In a thread on Slack one of the maintainers said that they'd like to get OpenTelemetry support in for the next release. If that is the case a release might be a bit off and I hope that we can ask nicely for a bugfix release 1.0.2 instead.Option 2: Revert to JMX Exporter 0.20
This is an option we have and it'd be relatively easy to do (but requires some changes as the metrics path changed) but we'd like to avoid downgrading a dependency in case vulnerabilities are discovered.
Option 3: Build a patched JMX Exporter ourselves
We tried building a JMX Exporter with the current main branch of
client_java
and that fixes the performance issues so we know that the fixes are good. We could build the exporter from source if needed.