Prometheus Exporter should add InstrumentationScope labels

gouthamve commented 1 year ago

Component(s)

exporter/prometheusremotewrite

What happened?

Description

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/6e47b92777a005e5df421ac25ca4f7becbbe3a3d/pkg/translator/prometheusremotewrite/metrics_to_prw.go#L52

The above TODO made me think what we should be doing about InstrumentationScope, and @Aneurysm9 mentioned that we should copy the name and attributes from InstrumentationScope into the metrics. Otherwise we're likely going to end up with duplicate metrics.

cc @kovrus @dashpole

Collector version

main

Environment information

No response

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

github-actions[bot] commented 1 year ago

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9

See Adding Labels via Comments if you do not have permissions to add labels yourself.

kovrus commented 1 year ago

@gouthamve @Aneurysm9 The specification for the instrumentation scope look ok as for me. Shall we probably implement it, do you have any objections or comments?

gouthamve commented 1 year ago

Ah, I missed that part of the Spec.

If we implement it that way, it means the way to group by scope attributes is to join on otel_scope_name, otel_scope_version, job, instance.

Now this brings me to wonder, in the same process can there be two scopes with the same otel_scope_name, otel_scope_version but with different attributes? If yes, then the joins won't work anymore and the data is quite inaccessible.

gouthamve commented 1 year ago

https://opentelemetry.io/docs/reference/specification/glossary/#instrumentation-scope

If the unit of code has a version then the instrumentation scope is defined by the (name,version) pair otherwise the version is omitted and only the name is used. The name or (name,version) pair uniquely identify the logical unit of the code that emits the telemetry.

This kinda says that it should be unique per process, however I wonder how it will be for libraries. I might be misunderstanding, but here is an example:

If we have a memcache library with scope.name=io.cache.memcached. But if we instantiate the library twice (once for read cache, and once for auth cache), does the scope change? Wouldn't we add a scope attribute called: cache={read_cache,auth_cache} when instantiating the library to differentiate?

Aneurysm9 commented 1 year ago

If we have a memcache library with scope.name=io.cache.memcached. But if we instantiate the library twice (once for read cache, and once for auth cache), does the scope change? Wouldn't we add a scope attribute called: cache={read_cache,auth_cache} when instantiating the library to differentiate?

I would expect those attributes to be added to instruments or measurements taken by the instrumentation, not to the instrumentation scope itself. My understanding is that instrumentation scope attributes should relate to the instrumentation itself and generally be known/knowable at compile time:

The instrumentation scope may have zero or more additional attributes that provide additional information about the scope. For example for a scope that specifies an instrumentation library an additional attribute may be recorded to denote the URL of the repository URL the library’s source code is stored. Since the scope is a build-time concept the attributes of the scope cannot change at runtime.

I believe it is also presently undefined whether instrumentation scope attributes are identifying for metrics or not, so we may need to treat them as not identifying (i.e., not as labels on metrics but only on the otel_scope_info metric) for now.

@gouthamve @Aneurysm9 The specification for the instrumentation scope look ok as for me. Shall we probably implement it, do you have any objections or comments?

Thanks for this, I thought that we had already done something like this but I was conflating it with target_info and didn't have access to the spec when I was discussing it with @gouthamve. I think proceeding with implementing this spec for otel_scope_info is the logical next step here.

gouthamve commented 1 year ago

That helps a lot @Aneurysm9. With this context, I think it makes sense to go ahead with otel_scope_info.

nahk-ivanov commented 1 year ago

We actually have this problem of not being able to distinguish one instrument from another. What's your thinking in regards to otel_scope_name, where is it going to show up?

The way Open Telemetry operates is essentially a hierarchy of Meter -> Instrument (counter, histogram, etc.). As the example given on that page: you will have Meter named io.opentelemetry.contrib.mongodb.client and histogram named client.duration under that meter. From this perspective, would it make sense to simply consider Meter name to be an equivalent of the Prometheus namespace and prepend to the metric name, for the final result of io_opentelemetry_contrib_mongodb_client_client_duration? Essentially, if I have Mongo and S3 accessed in the same application, I will have two separate metrics, rather than filtering on the label for a single metric. Is this how it's supposed to be natively in Prometheus, or am I not getting the purpose of Prometheus namespace? I guess we will be violating "single word" convention, though. 🤷

The reason I'm asking is because linked TODO suggests exporting them as labels/dimensions on the metric, rather than being part of the name. 🤔 I think the problem is that meaning of this "scope" is overloaded in Open Telemetry - it's the name of the instrumentation library (which sounds like an attribute/label), name of the meter (which sounds like a namespace), and just the name of some abstract "scope" at the same time.

marcalff commented 1 year ago

Related issue:

https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/21471

blakeroberts-wk commented 1 year ago

The scope information should not be prepended to the metric name. Some instrumentation library names include version information that would make queries used in dashboards/alerts brittle. Examples of instrumentation library names including version information are: io.opentelemetry.tomcat-10.0, and io.opentelemetry.servlet-5.0. And to be clear, these versions are not the library versions. Both of these libraries, for example, may share the version 1.26.0-alpha.

I believe it is also presently undefined whether instrumentation scope attributes are identifying for metrics or not, so we may need to treat them as not identifying (i.e., not as labels on metrics but only on the otel_scope_info metric) for now.

It may not be explicitly defined, but it seems implicitly defined. The instrumentation scope attributes are required to identify the appropriate time series. If you create two instruments with the same name but different instrumentation names, then you will produce two values per report. Without the instrumentation library name and version, cumulative metric backends will fail to uniquely identify the time series to which the values pertain. For example, Mimir will produce the error, the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp).

It seems that the only viable option is to add the instrumentation library and version as labels on the produced metric. Unless I am missing something?

nahk-ivanov commented 1 year ago

Some instrumentation library names include version information that would make queries used in dashboards/alerts brittle. Examples of instrumentation library names including version information are: io.opentelemetry.tomcat-10.0, and io.opentelemetry.servlet-5.0. And to be clear, these versions are not the library versions.

I'd argue that it is done on purpose to actually be able to have them separate on dashboards. For instance, servlet-5.0 meter can have different instruments under it, compared to servlet-6.0 meter, so "one chart to rule them all" is not possible, unless you manually merge two different instruments from two meters together. I would not be so aggressive discarding the idea of associating meter name with Prometheus namespace, I think they do align.

blakeroberts-wk commented 1 year ago

I'd argue that it is done on purpose to actually be able to have them separate on dashboards.

This functionality is still possible if the information exists as a label.

so "one chart to rule them all" is not possible, unless you manually merge two different instruments from two meters together.

Exactly. If I was a service owner, I wouldn't want to have to update my dashboards (or recording rules and alerts) to merge metrics any time I updated a version of my auto instrumentation. This would be especially troublesome for alerts. Now any version update of my telemetry requires a migration story to ensure zero downtime during a rollout of what could be a completely innocuous change.

nahk-ivanov commented 1 year ago

Hm... I don't follow. You said that version in the name of the meter is not the version of the meter, but rather a version of the component being instrumented (i.e. Tomcat server v10 - io.opentelemetry.tomcat-10.0). This would make sense, given that instruments for different versions of the software simply would be different (in a generalized case), hence will be reported under different meters. But as long as you continue using Tomcat v10 you would not need to update your "dashboards (or recording rules and alerts)", because meter name will stay the same. Only the meter version will change, and meter version will be in the labels for exactly this purpose - to not require you to update the "dashboards (or recording rules and alerts)" while you are still using Tomcat v10, no matter what version of instrumentation library you are using, because meter name will stay the same - tomcat-v10.0.

Basically, io.opentelemetry.tomcat-10.0 and io.opentelemetry.tomcat-11.0 are just different meters by design, like io.opentelemetry.tomcat-10.0 and io.opentelemetry.iis-7.0 are, and I think from this perspective it's reasonable to treat meter name as Prometheus namespace.

From Prometheus documentation:

For metrics specific to an application, the prefix is usually the application name itself.

So tomcat10 might as well be the Prometheus namespace, if Tomcat version 10 is what's being instrumented. OpenTelemetry does, however, come with semantic conventions, so for application servers that serve HTTP traffic you are likely to use http prefix, but if it is an application that doesn't have semantic association (or a metric that is specific to a given application server), then you would use the application name as a prefix, as stated in Prometheus docs. Given that different versions of application emit different metrics (because implementation is different) - you may include application version in the prefix. For example, tomcat10_middleware_X_duration may not exist under tomcat11 meter, because version 11 replaces the middleware stack and there is now middleware_Y used instead.

To add insult to injury, I think some (if not all) standard OpenTelemetry instrumentation libraries for application servers currently prepend all instrument names with http 🤦, so the argument can be made that Prometheus prefix (application name) is expected to be part of OpenTelemetry instrument name, although that's not the case for client instrumentations, where their example shows client.duration instrument name and mongodb being specified in the meter name (io.opentelemetry.contrib.mongodb.client). 🤷

Finally, as to the

This functionality is still possible if the information exists as a label.

Yes, this is mostly logical structure that we need to decide on. Why does prefix exist in Prometheus, if it could just be a label and it "is still possible if the information exists as a label"? I don't think there is anything utterly wrong with meter names being in labels, I'm just not sure, if it will be too pleasant to see one single client.duration metric that I have to filter with scope_name~="MongoDb", scope_name~="MySQL", etc.). I think we can balance it out between metric labels and names. For instance, if Prometheus suggests that client_duration metric should have mongodb prefix, and in OpenTelemetry MongoDb is specified in the meter name (instead of an instrument name), then why wouldn't meter name be considered an equivalent of Prometheus namespace?

blakeroberts-wk commented 1 year ago

The full quote from the document you have referenced is as follows:

A metric name ... ... should have a (single-word) application prefix relevant to the domain the metric belongs to. The prefix is sometimes referred to as namespace by client libraries. For metrics specific to an application, the prefix is usually the application name itself. Sometimes, however, metrics are more generic, like standardized metrics exported by client libraries. Examples:

prometheus_notifications_total (specific to the Prometheus server)

process_cpu_seconds_total (exported by many client libraries)

http_request_duration_seconds (for all HTTP requests)

https://prometheus.io/docs/practices/naming/

In the case of Tomcat or any other HTTP server, the instrumentation library that produces the instruments used to record things like latency are not "specific to an application". This is a generic, standardized metric exported by many different client libraries used by many different applications. Neither OpenTelemetry nor Prometheus recommend prepending scope information to the metric name.

nahk-ivanov commented 1 year ago

Okay, let me break it down this way, maybe it will help to understand the point I'm trying to convey.

In your mind, how would a metric that measures the duration of the requests to mongodb storage from the client side called in Prometheus? client_duration, or mongodb_client_duration, or something else?

blakeroberts-wk commented 1 year ago

I'm not sure the answer to this question is relevant to the discussion. In general, I would suggest following OpenTelemetry semantic conventions as closely as possible when defining a new metric name. In this case, there is not a convention for database client duration. This would need to be discussed with a larger audience, but I imagine the name would be db.client.duration in OpenTelemetry and db_client_duration in Prometheus. I come to this suggestion based on the following documentation:

A point could be made that a database request duration is slightly more ambiguous than an HTTP request duration. And based on OpenTelemetry semantic conventions, this may require that the metric name need be db.mongodb.client.duration:

Semantic ambiguity SHOULD be avoided. Use prefixed metric names in cases where similar metrics have significantly different implementations across the breadth of all existing metrics. For example, every garbage collected runtime has slightly different strategies and measures. Using a single set of metric names for GC, not divided by the runtime, could create dissimilar comparisons and confusion for end users. (For example, prefer process.runtime.java.gc over process.runtime.gc..) Measures of many operating system metrics are similarly ambiguous.

But regardless of either name, mongodb would not be the instrumentation library name or version. Either of these may include version information that should not appear in the metric name.

nahk-ivanov commented 1 year ago

But regardless of either name, mongodb would not be the instrumentation library name

And this is where it breaks - that's where it's suggested to be per OpenTelemetry metrics specs:

+-- Meter(name='io.opentelemetry.contrib.mongodb.client', version='2.3.0')
...
    |
    +-- Instrument<Histogram, double>(name='client.duration', attributes=['server.address', 'server.port'], unit='ms')
    |

What you call "instrumentation library name" (and "scope") is a meter name, per spec.

I do think it's a problem of rather vague/incomplete conventions and confusing terminology, but if we were to agree that we want to see mongodb in the metric name, and in OpenTelemetry mongodb would appear in the meter name (per example in the metrics specs), rather than in the instrument name, then we would want to merge that into a final metric name, me thinks. Yes, there is additional collateral in the meter name, but there are pros and cons to everything. 🤷

I guess what I'm hearing is that we want to standardize on Meter name being used solely for the "instrumentation library name" and instrument names to be prefixed with instrumented components (where implementation of the said metric (subjectively) "differs" between different components of the same purpose). This would work, we just need to make sure all instrumentation libraries follow this "convention" and we don't give people "bad" examples in the specs. :(

dashpole commented 1 year ago

I do think it's a problem of rather vague/incomplete conventions and confusing terminology, but if we were to agree that we want to see mongodb in the metric name, and in OpenTelemetry mongodb would appear in the meter name (per example in the metrics specs), rather than in the instrument name, then we would want to merge that into a final metric name, me thinks. Yes, there is additional collateral in the meter name, but there are pros and cons to everything.

We explored adding a prefix to metrics as an option in https://github.com/open-telemetry/opentelemetry-specification/pull/2703, and decided to use labels to identify the library producing the metrics instead.

brichins commented 1 year ago

To add more context, it seems that the instrumentation scope name adds more context/details to the service. For example for the OTel Java Agent, a simple Tomcat 9 server hosting the Spring petclinic application has the following scopes that are added to all metrics, helping the operator/developer know which instrumentation library was responsible for the metric:

io.opentelemetry.hikaricp-3.0
io.opentelemetry.jmx
io.opentelemetry.runtime-telemetry-java8
io.opentelemetry.tomcat-7.0

Adding the scope name would also have the prometheus exporter align more closely with the prometheus metrics exposed by the OTel Java agent. For example For that Tomcat/Petclinic application, here's one metric for the number of active sessions.

OTel Java Agent

# TYPE http_server_tomcat_sessions_activeSessions gauge
# HELP http_server_tomcat_sessions_activeSessions The number of active sessions
http_server_tomcat_sessions_activeSessions_ratio{otel_scope_name="io.opentelemetry.jmx",context="/petclinic"} 0.0 1692139219782

OTel Collector with Prometheus exporter

# HELP http_server_tomcat_sessions_activeSessions The number of active sessions
# TYPE http_server_tomcat_sessions_activeSessions gauge
http_server_tomcat_sessions_activeSessions{context="/petclinic",host_name="bddb31813b2f",job="petclinic",service_name="petclinic"} 0

The could be cases where certain scopes would not be desirable on certain dashboards. In my opinion having the option to optionally add the scopes would be useful.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9 @rapphil

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 11 months ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.

jmichalek132 commented 11 months ago

still needed, afaik is also being worked on in https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/24248

open-telemetry / opentelemetry-collector-contrib