prometheus / client_java

Prometheus instrumentation library for JVM applications
http://prometheus.github.io/client_java/
Apache License 2.0
2.18k stars 797 forks source link

Same metric name, different set of label names #696

Open shakuzen opened 3 years ago

shakuzen commented 3 years ago

We have a lot of Micrometer users requesting that we support registering metrics with the same name but a different set of labels (https://github.com/micrometer-metrics/micrometer/issues/877) for Prometheus. This repo, the official prometheus java client, does not allow this using the built-in types, as can be demonstrated by the following unit test:

@Test
void promClientSameMetricNameDifferentLabels() {
    CollectorRegistry registry = new CollectorRegistry();
    Counter counter = Counter.build().name("api_response").labelNames("outcome").help(" ")
            .register(registry);
    Counter counter2 = Counter.build().name("api_response").labelNames("outcome", "error").help(" ")
            .register(registry); // fails with "Collector already registered that provides name: api_response_total"
}

It is interesting to note, however, that the Prometheus server will scrape the following without any seeming issue.

# HELP responses
# TYPE app_responses_total counter
app_responses_total{uri="/", outcome="SUCCESS"} 10
app_responses_total{uri="/", outcome="FAILURE", error="INVALID_REQUEST"} 2
app_responses_total{uri="/", outcome="FAILURE", error="DATA_NOT_FOUND"} 3

We have a custom Collector implementation for converting from Micrometer metrics to Prometheus metrics. We have received a pull request in Micrometer with changes that would allow Micrometer to produce a scape endpoint like above, circumventing the exception the Prometheus java client would otherwise throw when registering a metric with a different set of labels. See https://github.com/micrometer-metrics/micrometer/pull/2653.

I am opening an issue here because I would rather we did not do something in Micrometer that is intentionally guarded against in the official java client for Prometheus. I am not sure of all the implications supporting something like this might have on consumers. Is preventing different labels still an intentional choice here and are there plans for this library to support that? Would you recommend we avoid allowing it as well in Micrometer? Why does the Prometheus server scrape such output without error if it is considered an error in the client? I appreciate any advice we can get on this from the Prometheus maintainers.

brian-brazil commented 3 years ago

https://prometheus.io/docs/instrumenting/writing_clientlibs/#labels is relevant here.

fstab commented 3 years ago

My understanding is that the Micrometer API already allows users to have different label names for the same metric, but currently this leads to internal exceptions when you use Micrometer in combination with the Prometheus registry. So the question is: Would Prometheus support break if you merge the PR to fully support the existing Micrometer API.

I think nothing would break if you merged the PR.

In Prometheus, missing labels are treated in the same way as empty labels, so the Prometheus server will treat

app_responses_total{uri="/", outcome="SUCCESS"} 10
app_responses_total{uri="/", outcome="FAILURE", error="INVALID_REQUEST"} 2
app_responses_total{uri="/", outcome="FAILURE", error="DATA_NOT_FOUND"} 3

the same way as

app_responses_total{uri="/", outcome="SUCCESS", error=""} 10
app_responses_total{uri="/", outcome="FAILURE", error="INVALID_REQUEST"} 2
app_responses_total{uri="/", outcome="FAILURE", error="DATA_NOT_FOUND"} 3

In the Prometheus Java client library there's a separation between defining label names (you do this only once, and it will create a "metric family") and observing a concrete event (this will create or update a "sample" within the "metric family" with the observed label values). Creating label names on the fly would be hard to implement with our current architecture, and for the reasons Brian linked above it might not be a good idea anyway. That's why the Prometheus Java client library does not support this for simple collectors.

However, given that Micrometer already has an API supporting this and the question is whether or not you can merge this PR to avoid internal exceptions with the Micrometer Prometheus registry, I'd say this PR should not break Micrometer's Prometheus support.

@brian-brazil please correct me if this is wrong and if I'm overlooking any implications.

brian-brazil commented 3 years ago

I don't think anything will technically break, however this is strongly discouraged. Metrics with varying label names are very difficult to work with for users and not in line with the guidelines for how to use labels. If label names vary, then you should be using different metric names, as they are fundamentally different metrics.

Think of it this way, if you suddenly pull in a library that has a clashing metric name rather than having an obvious failure at startup indicating that the name clash needs fixing instead your graphs/alerts will silently break.

Keep in mind that a given metric name should almost always exist only inside one file/class - independent of whether they have different label names. You should also know all label names a-priori for metrics instrumentation.

For the original issue of https://github.com/micrometer-metrics/micrometer/issues/877, I'd recommend that both these metrics should both be renamed to include client/partition as appropriate. If the PR was merged, then rather than correctly detecting the misuse you'd end up double counting, and making things harder for users to work with.

nhartner commented 3 years ago

This label restriction is some Prometheus dogma that needs to be reevaluated. Consider how many people are asking for a feature that they would find valuable and which technically is already possible. The main reason clients like the Java client don't support it appears to be documentation dogma.

stolsvik commented 3 years ago

Just want to chime in that based on the Prometheus querying basics page, the situation of "missing tags" (which is the result of having different subsets of tags on metrics with the same name) seems to be explicitly handled: "Label matchers that match empty label values also select all time series that do not have the specific label set at all.", as also commented by @fstab above.

To give context, I have a situation where I do have a metric (effectively "time to send message") that are used in two different settings, but whose semantic meaning and measured values are identical. The two settings are either in "initiation", or "inside a stage". The set of tags/labels for those two settings will be different, but I would like to use the same name, as what is measured is identical. Aggregating across these two settings makes perfect sense - but it also makes sense to be able to select either the "initiations" or "inside a stage". I thus feel that using the same name, but with different tags, is the correct solution here.

stolsvik commented 3 years ago

@brian-brazil Think of it this way, if you suddenly pull in a library that has a clashing metric name rather than having an obvious failure at startup indicating that the name clash needs fixing instead your graphs/alerts will silently break.

Well, the current alternative then is to not use the other library, isn't it?

Currently, the "solution" for this problem in micrometer is to just ignore the second registration, which is unfortunate along multiple axes - but it is pretty much forced to do that to not break exactly that usage scenario.

In addition, if those two libraries do use the same name (which I actually would find strange, I myself always add some kind of "library-identifier" element to such metrics), it should still be possible to distinguish them by their differing use of labels. This must be a fact, as otherwise - if they used the same labels too - the java client would actually allow the registration.

nberkley-gp commented 3 months ago

The idea that Prometheus discourages this seems a little absurd, to be honest. Their own internally generated metrics don't even honor this convention. For instance each series in up will have the labels of its corresponding scrape target.

It's all well and good to suggest that people design metrics with consistent labeling in mind, but enforcing it by arbitrary deletion of inconsistent metrics from reporting seems misguided.