Closed dhoard closed 5 months ago
Background
As part of the work in client_java
1.x.x to integrate with OpenMetrics/OpenTelemetry metric names are enforced.
Kafka appears to create metrics with suffixes that violate the OpenMetrics/OpenTelemetry specification.
Stacktrace
An Exception occurred while scraping metrics: java.lang.IllegalArgumentException: 'kafka_server_socketservermetrics_failed_handshake_total': Illegal metric name. The metric name must not include the '_total' suffix. Call PrometheusNaming.sanitizeMetricName(name) to avoid this error.
at io.prometheus.metrics.model.snapshots.MetricMetadata.validate(MetricMetadata.java:106)
at io.prometheus.metrics.model.snapshots.MetricMetadata.<init>(MetricMetadata.java:63)
at io.prometheus.metrics.model.snapshots.MetricSnapshot$Builder.buildMetadata(MetricSnapshot.java:80)
at io.prometheus.metrics.model.snapshots.GaugeSnapshot$Builder.build(GaugeSnapshot.java:129)
at io.prometheus.jmx.JmxCollector.collect(JmxCollector.java:832)
at io.prometheus.metrics.model.registry.MultiCollector.collect(MultiCollector.java:26)
at io.prometheus.metrics.model.registry.PrometheusRegistry.scrape(PrometheusRegistry.java:72)
at io.prometheus.metrics.exporter.common.PrometheusScrapeHandler.scrape(PrometheusScrapeHandler.java:112)
at io.prometheus.metrics.exporter.common.PrometheusScrapeHandler.handleRequest(PrometheusScrapeHandler.java:53)
at io.prometheus.metrics.exporter.httpserver.MetricsHandler.handle(MetricsHandler.java:43)
at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:95)
at jdk.httpserver/sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:82)
at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:98)
at jdk.httpserver/sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:851)
at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:95)
at jdk.httpserver/sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:818)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Sanitizing the name could result in duplicate metrics, violating metric exposition constraints:
sanitized(foo_total) = foo sanitized (foo) = foo
Debugging shows two similar MBeans registered with different names.
MBean 1 has a dash (-
) between topics
and partitions
...
MBean 2 has a period (.
) between topics
and partitions
These both will get sanitized to the same name with the dashes (-
) and the periods (.
) replaced with underscores (_
), resulting in duplicate metrics.
254201 [FINE] io.prometheus.jmx.JmxScraper kafka.rest{type=jersey-metrics}v3.topics-partitions-reassignment.list.response-rate scrape: 0.0
254202 [FINE] io.prometheus.jmx.JmxCollector add metric sample: kafka_rest_jersey_metrics_v3_topics_partitions_reassignment_list_response_rate [] [] 0.0
254203 [FINE] io.prometheus.jmx.JmxCollector matchedRule.name [kafka_rest_jersey_metrics_v3_topics_partitions_reassignment_list_response_rate] sanitizedName [kafka_rest_jersey_metrics_v3_to pics_partitions_reassignment_list_response_rate]
254573 [FINE] io.prometheus.jmx.JmxScraper kafka.rest{type=jersey-metrics}v3.topics.partitions-reassignment.list.response-rate scrape: 0.0
254574 [FINE] io.prometheus.jmx.JmxCollector add metric sample: kafka_rest_jersey_metrics_v3_topics_partitions_reassignment_list_response_rate [] [] 0.0
254575 [FINE] io.prometheus.jmx.JmxCollector matchedRule.name [kafka_rest_jersey_metrics_v3_topics_partitions_reassignment_list_response_rate] sanitizedName [kafka_rest_jersey_metrics_v3_to pics_partitions_reassignment_list_response_rate]
Debugging shows that...
... converts .
and -
(unsafe characters per the Prometheus metrics specification) to _
, which results in duplicate metric names for two different MBeans.
In these scenarios, the first MBean metric would be used. Subsequent MBeans metrics (with the same name) would be ignored.
Thanks a lot @dhoard for analyzing this!!!
I guess adding a label like mbean_name
in case there is a name conflict is the best way to solve this. Something like this:
v3_topics_partitions{mbean_name="v3.topics-partitions"} ...
v3_topics_partitions{mbean_name="v3.topics.partitions"} ...
If we add that label only in case of name collisions it should have minimal impact.
However, this will beak if the metrics have different types, so we should set the type UNKNOWN
for all metrics with name collisions.
I'm wondering what the previous versions of jmx_exporter
did. My guess is metrics were just overwriting each other, so it was basically random which value you got.
While the example uses UNNOWN
metric types (because of Kafka), the issue can occur for UNTYPES
(treated as UNKNOWN
), COUNTER
and GAUGE
metrics.
I'm wondering what the previous versions of jmx_exporter did. My guess is metrics were just overwriting each other, so it was basically random which value you got.
@fstab Correct. The previous jmx_exporter
versions would use the first MBean, dropping the metrics for the second MBean, etc. if the sanitized name matched, resulting in missed metrics for some MBeans.
~When an MBean has tabular data with no keys, tracking the MBean name alone isn't enough to guarantee uniqueness.~ - JUnit test issue
@fstab I have code working on my branch (https://github.com/dhoard/jmx_exporter/tree/issue-958) by adding an _x_id_
label to every non-JVM metric.
I have gone down this path because...
# HELP kafka_rest_jersey_metrics_v3_topics_partitions_reassignment_list_response_rate The average number of HTTP responses per second. kafka.rest:name=null,type=jersey-metrics,attribute=v3.topics-partitions-reassignment.list.response-rate
# TYPE kafka_rest_jersey_metrics_v3_topics_partitions_reassignment_list_response_rate untyped
kafka_rest_jersey_metrics_v3_topics_partitions_reassignment_list_response_rate{} 0.0
kafka_rest_jersey_metrics_v3_topics_partitions_reassignment_list_response_rate{_x_id_="kafka.rest:type=jersey-metrics,attribute=v3.topics.partitions-reassignment.list.response-rate"} 0.0
Map
of previously encountered metric names and performing a lookup when processing every MBean attribute value to check for a duplicate. (XSnapshot.Builder
classes have no computeIfAbsent
method to check for a duplicate datapoint.)My branch has been updated to include a label _x_id_
on all non-JVM metrics that contains a Murmur3 hash of ObjectName + ,atttribute=
+ attribute name.
Analysis
rules:
- pattern: ".*"
... increases by ~140KB (11.5% increase)
The transfer time for 140KB on a 1Gbps network is ~1.1 milliseconds.
Calculating a Murmur3 hash is performant and the probability of a hash collision is practically 0.
P ≈ 1− (e-1/2147483648)
(e-1/2147483648) ≈ 1
P ≈ 1 - 1
P ≈ 0
Please let me know when your branch is ready to review!
Can we add the extra label only to the metrics that have a name collision, rather than to all metrics? I think name collisions are rare, so most metrics won't need that label.
Possibly... a couple of options...
A
Change the client_java
...Snaphot.Builder
classes to throw a DuplicateLabelsException
when adding a ...DataPointSnapshot
value with colliding labels. This breaks the builder pattern.
B
The JMX exporter code would need to keep maps of the ...DataPointSnapshot.Builder
objects independently of the ...Snapshot.Builder
objects.
When building the MetricSnapshots
... all of the logic to detect duplicates/adding an additional label for uniqueness would have to be implemented in the JMX Exporter code. (Collect everything in isolation... merge in the end.)
This is further complicated because none of the builder classes implement hashCode()
/ equals(Object other)
so they can't be used as Map keys.
Thoughts?
I am confused. Is the problem conflicting metric names, or conflicting labels?
Let me try to clarify...
Test MBean
package io.prometheus.jmx;
import javax.management.JMException;
import javax.management.MBeanServer;
import javax.management.ObjectName;
public interface CollidingMBean {
int getXid();
}
class Colliding implements CollidingMBean {
private final int value;
public Colliding(int value) {
this.value = value;
}
@Override
public int getXid() {
return value;
}
static void registerBean(MBeanServer mbs) throws JMException {
ObjectName objectName =
new ObjectName("io.prometheus.jmx.test:type=Colliding-Bean");
Colliding mBean = new Colliding(123);
mbs.registerMBean(mBean, objectName);
objectName = new ObjectName("io.prometheus.jmx.test:type=Colliding.Bean");
mBean = new Colliding(321);
mbs.registerMBean(mBean, objectName);
objectName = new ObjectName("io.prometheus.jmx.test:type=Colliding_Bean");
mBean = new Colliding(246);
mbs.registerMBean(mBean, objectName);
objectName = new ObjectName("io.prometheus.jmx.test:type=Colliding|Bean");
mBean = new Colliding(246);
mbs.registerMBean(mBean, objectName);
}
}
Test Scenario
Register multiple independent instances of the MBean with different ObjectNames...
io.prometheus.jmx.test:type=Colliding-Bean
io.prometheus.jmx.test:type=Colliding.Bean
io.prometheus.jmx.test:type=Colliding_Bean
io.prometheus.jmx.test:type=Colliding|Bean
Exporter code
During scraping, the exporter "sanitizes" the name to match the Prometheus/OpenMetrics metrics name specification and adds the attribute.
sanitized(io.prometheus.jmx.test:type=Colliding-Bean
) = io_prometheus_jmx_test_Colliding_Bean
_attributeName
sanitized(io_prometheus_jmx_test_Colliding_Bean
) = io_prometheus_jmx_test_Colliding_Bean
_attributeName
sanitized(io_prometheus_jmx_test_Colliding_Bean
) = io_prometheus_jmx_test_Colliding_Bean
_attributeName
sanitized(io_prometheus_jmx_test_Colliding_Bean
) = io_prometheus_jmx_test_Colliding_Bean
_attributeName
Because the "sanitized" name is the same, these independent MBean are merged into a single metric with multiple datapoint snapshots, each having the same labels.
switch (matchedRule.type) {
case "COUNTER":
{
CounterSnapshot.Builder counterBuilder =
countersMap.computeIfAbsent(
sanitizedName,
key ->
CounterSnapshot.builder()
.name(sanitizedName)
.help(finalMatchedRule.help));
counterBuilder.dataPoint(
CounterSnapshot.CounterDataPointSnapshot.builder()
.labels(labels)
.value(value.doubleValue())
.build());
break;
}
case "GAUGE":
{
GaugeSnapshot.Builder gaugeBuilder =
gaugeMap.computeIfAbsent(
sanitizedName,
key ->
GaugeSnapshot.builder()
.name(sanitizedName)
.help(finalMatchedRule.help));
gaugeBuilder.dataPoint(
GaugeSnapshot.GaugeDataPointSnapshot.builder()
.labels(labels)
.value(value.doubleValue())
.build());
break;
}
case "UNKNOWN":
case "UNTYPED":
default:
{
UnknownSnapshot.Builder unknownBuilder =
unknownMap.computeIfAbsent(
sanitizedName,
key ->
UnknownSnapshot.builder()
.name(sanitizedName)
.help(finalMatchedRule.help));
unknownBuilder.dataPoint(
UnknownSnapshot.UnknownDataPointSnapshot.builder()
.labels(labels)
.value(value.doubleValue())
.build());
break;
}
}
When the code to build the MetricSnapshots
is executed...
for (CounterSnapshot.Builder counter : receiver.countersMap.values()) {
result.metricSnapshot(counter.build());
}
for (GaugeSnapshot.Builder gauge : receiver.gaugeMap.values()) {
result.metricSnapshot(gauge.build());
}
for (UnknownSnapshot.Builder unknown : receiver.unknownMap.values()) {
result.metricSnapshot(unknown.build());
}
... we experience the DuplicateLabelsException
Initial Approach
My initial approach was to keep a map of "sanitized" name to ...SnapShot.Builder
and a separate map of "sanitized" name to a list of ...DataPointSnapshot.Builder
s.
This doesn't work because the ...DataPointSnapshot.Builder
s have no methods to modify their contents.
Map<String, CounterSnapshot.Builder> snapshotBuilderMap = new HashMap<>();
Map<String, List<CounterSnapshot.CounterDataPointSnapshot.Builder>> snapshotDataPointBuilderMap = new HashMap<>();
for (Map.Entry<String, CounterSnapshot.Builder> entry : snapshotBuilderMap.entrySet()) {
String key = entry.getKey();
CounterSnapshot.Builder builder = entry.getValue();
List<CounterSnapshot.CounterDataPointSnapshot.Builder> list = snapshotDataPointBuilderMap.get(key);
for (CounterSnapshot.CounterDataPointSnapshot.Builder dataPointBuilder : list) {
try {
builder.dataPoint(dataPointBuilder.build());
} catch (DuplicateLabelsException e) {
// CounterSnapshot.CounterDataPointSnapshot.Builder doesn't
// have a method to add a label to the existing labels
// and doesn't have a method to get the current labels
}
}
}
I have been trying to find a solution without requiring a client_java
change/release.
Thanks a lot for the example, that was helpful.
I implemented a proposal how to solve this in #960. What do you think?
Resolve via https://github.com/prometheus/jmx_exporter/pull/960. (post 1.0.0 release)
Collection fails for Kafka using release 1.0.0. Investigation is underway.