stackabletech / issues

This repository is only for issues that concern multiple repositories or don't fit into any specific repository
2 stars 0 forks source link

JMX Exporter Fix #649

Closed lfrancke closed 1 week ago

lfrancke commented 1 month ago

In SDP 24.7 we upgraded to JMX Exporter 1.0.1. Unfortunately this caused the performance of the metrics endpoints to degrade severly. We tracked the issue down to a piece of code in the Prometheus Java Client which has since been fixed:

For our next release we need to fix this and we see these options:

Option 1: New upstream JMX Exporter

This requires a new client_java release first which we hope to see in the week of October 14 according to a thread on Slack. When that is done we should put up a PR with jmx_exporter upgrading the client java version and then nicely ask if anyone is up for a new jmx_exporter release. In a thread on Slack one of the maintainers said that they'd like to get OpenTelemetry support in for the next release. If that is the case a release might be a bit off and I hope that we can ask nicely for a bugfix release 1.0.2 instead.

Option 2: Revert to JMX Exporter 0.20

This is an option we have and it'd be relatively easy to do (but requires some changes as the metrics path changed) but we'd like to avoid downgrading a dependency in case vulnerabilities are discovered.

Option 3: Build a patched JMX Exporter ourselves

We tried building a JMX Exporter with the current main branch of client_java and that fixes the performance issues so we know that the fixes are good. We could build the exporter from source if needed.

NickLarsenNZ commented 3 weeks ago

@lfrancke what is the cutoff date for this decision? IMO, it is ok to wait until after the on-site, but it would be good to have a rough date in mind, and also a list of things that should be tested (so we can continue testing everything else and trust that if this comes in, the necessary things will be re-tested).

From memory we were waiting on https://github.com/prometheus/jmx_exporter/pull/995 (edit: ah yeah I see you linked that).

lfrancke commented 3 weeks ago

The maintainer is actively working on this. I hope we'll have a release "soon".

Can we do it the other way around? Tell me what the latest date is you'd accept this change and I'll make sure that we have a solution ready by then.

NickLarsenNZ commented 3 weeks ago

Can we do it the other way around? Tell me what the latest date is you'd accept this change and I'll make sure that we have a solution ready by then.

Sure thing, how about a week before release (CoB on Friday, 8th November). I think by then it wouldn't be too much effort to add an extra suite of tests no matter which way this goes. We can also extend that somewhat if we feel like we have capacity.

lfrancke commented 2 weeks ago

I just tested the latest/current main branch of JMX Exporter and can confirm that it fixes the performance issue.

So, if we don't get a release in time we can build one ourselves and use a patched version.

Let's wait a few days longer.

lfrancke commented 1 week ago

Fixed in:

lfrancke commented 4 days ago

Release Notes

In SDP 24.7 we upgraded the version of JMX Exporter from 0.20 to 1.0.1. This is the tool which allows us to expose JMX as Prometheus metrics and is in use for Hadoop, HBase, Hive, Kafka, Spark, Trino and ZooKeeper. Unfortunately the version 1.0.1 has a severe performance degradation which has been fixed upstream but is not released yet. This SDP release 24.11 contains a fixed version bringing performance back to normal levels.