open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.74k stars 887 forks source link

Allow to unregister/stop/destroy instruments #2232

Open mateuszrzeszutek opened 2 years ago

mateuszrzeszutek commented 2 years ago

What are you trying to achieve?

I'm currently working on the micrometer->OTel bridge instrumentation in the javaagent. Micrometer offers the possibility to remove a meter from the MeterRegistry and stop emitting whatever metrics it used to collect. For example, suppose you use a database connection pool that's instrumented with metrics - when you close/destroy the whole pool you probably want to stop collecting any metrics associated to it (because it doesn't exist anymore). This is useful for both asynchronous instruments (since once they're registered there's no way to stop them) and synchronous instruments (you can just stop using them, but the metrics SDK will still send in the last recorded value).

Additional context.

Micrometer bridge PR: open-telemetry/opentelemetry-java-instrumentation#4919

CC @jsuereth

tigrannajaryan commented 2 years ago

(Following https://github.com/open-telemetry/opentelemetry-specification/blob/main/CONTRIBUTING.md#issue-triaging process).

@jsuereth I am reassigning this to you, I believe you know the context of this better.

bogdandrutu commented 2 years ago

@mateuszrzeszutek micrometer is an API + Some implementation(s). I am not sure why that should translate into calls to our API instead of offering an implementation of micrometer that implements the MetricProducer, see https://github.com/open-telemetry/opentelemetry-java/blob/44dccd6c33c7c4cb58e1cc63f0db02b3bfec1804/opencensus-shim/src/main/java/io/opentelemetry/opencensusshim/metrics/OpenCensusMetricProducer.java#L24

/cc @jsuereth

anuraaga commented 2 years ago

@bogdandrutu Micrometer was just one example but do you think apps should not have a way to stop reporting metrics? Dropwizard metrics in Java also has registry.remove and it seems idiomatic for a library that allows registering a stateful object to always allow unregistering it. While the whole MeterProvider could be closed, metric definitions are generally completely separate from the MeterProvider lifecycle and that probably doesn't solve that case.

mateuszrzeszutek commented 2 years ago

There are advantages to calling OTel API in micrometer bridge instrumentation:

Also, micrometer aside, the unregister/remove functionality is still very much needed. The database connection pool example from my first post here is still valid.

anuraaga commented 2 years ago

Just another note about Micrometer, I expect it to be used for a very long time by Spring and likely Spring users too as a result. For example, for tracing they will use Micrometer Tracing, not the OpenTelemetry tracing API. I think the reality of the Java ecosystem means we should consider Micrometer as a first class API for Java metrics, and this seems OK to me in practice. As such, I'd like Micrometer usage to benefit from features like exemplars / baggage which means going through our API (or well our SDK implementations directly maybe, but not MetricProducer which is too late to add cross-cutting concern features to the micrometer metrics).

bogdandrutu commented 2 years ago

@mateuszrzeszutek

Micrometer instruments like Timer and DistributionSummary can be translated to native OTel histograms; micrometer does not have a "histogram" meter by itself, it basically exports each bucket as a separate gauge.

It is just a representation difference, it is still a Histogram, I don't see the point.

Due to the API/SDK split in the javaagent using a MetricProducer inside an instrumentation would be very painful; the SDK classes are not directly accessible in instrumentation code, they're hidden away inside the agent classloader -- we would have to bridge MetricProducer and all related classes (and do lots of copying) to make that work at all.

The micrometer instrumentation is not really an instrumentation, is actually an SDK/Producer whatever you call it, since micrometer is an API + Impl to produce metrics. Not sure if any metrics library should be seen as "instrumentation", but rather as producers of telemetry.

Just another note about Micrometer, I expect it to be used for a very long time by Spring and likely Spring users too as a result. For example, for tracing they will use Micrometer Tracing, not the OpenTelemetry tracing API. I think the reality of the Java ecosystem means we should consider Micrometer as a first class API for Java metrics, and this seems OK to me in practice.

I am not saying to not have an integration (not instrumentation) with micrometer.

As such, I'd like Micrometer usage to benefit from features like exemplars / baggage which means going through our API (or well our SDK implementations directly maybe, but not MetricProducer which is too late to add cross-cutting concern features to the micrometer metrics).

It is very hard to do that since their API is not well designed for that (or was not designed with that in mind).

FYI: We don't have enough tracing APIs let's create another one :)))

mateuszrzeszutek commented 2 years ago

Micrometer instruments like Timer and DistributionSummary can be translated to native OTel histograms; micrometer does not have a "histogram" meter by itself, it basically exports each bucket as a separate gauge.

It is just a representation difference, it is still a Histogram, I don't see the point.

Well that's true, conceptually it's the same. It results in different MetricData being exported though.

As such, I'd like Micrometer usage to benefit from features like exemplars / baggage which means going through our API (or well our SDK implementations directly maybe, but not MetricProducer which is too late to add cross-cutting concern features to the micrometer metrics).

It is very hard to do that since their API is not well designed for that (or was not designed with that in mind).

Doesn't that sort of work by default? If you're using a Counter and call counter.add(1) it'll capture Context.current() even if you don't pass the context; so if you use a bridged micrometer synchronous instrument it should get the exemplar from the current context even though micrometer API does not explicitly support this.

jmacd commented 2 years ago

I would like to re-phrase this feature request: Can we allow deleting or unregistering Callbacks? For discussion in the Tuesday Jan 25 8AM PT Spec SIG.

jmacd commented 2 years ago

There is a connected topic discussed in https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/supplementary-guidelines.md#asynchronous-example-attribute-removal-in-a-view

The question is how, if at all, should an SDK know to stop reporting a metric?

anuraaga commented 2 years ago

Just wanted to add an example that it's fairly easy to introduce memory leaks into apps without the ability to unregister callbacks. For example if there is a connection pool that is being decomissioned, possibly due to a database resharding, then it should be able to be garbage collected. But with any natural registration (no complication of weak references) of an async instrumentation, without the ability to remove the callback from OpenTelemetry this pool can never be collected.

jsuereth commented 2 years ago

To back up @anuraaga's comment, while it's not as popular, old-school Java EE servers (like tomcat) or anything that uses hot / dynamic loading of plugins need the ability to clean up their memory and usage so Java can evict their bytecode and RAM.

For @bogdandrutu I want to directly address the notion that Micrometer should use MetricProducer as is done for OpenCensus.

The only reason we use MetricProducer bridge for OpenCensus is because the APi is so divergent from OpenTelemetry that we are unable to do a direct API bridge. If we could have done direct API, we'd have preferred that one. However because OpenCensus metric model is close enough to OTel's, we're not loosing too much with the MetricProducer bridge.

I think @mateuszrzeszutek raises some great points around Metric bridges, and I want to call out a few of them:

  1. We should prefer adapting existing metric instrumentation rather than trying to replace it. Micrometer is a decent API. We're not in competition, and we can be complementary. We have no reason to force users to re-instrument if they're happy.
  2. When we interface with existing Metric APIs we should enhance. There needs to be a reason users want OTEL, and in my opinion we (will) bring a few things to the table:
    • Resource-based telemetry correlation
    • Out of the box exemplars
    • Exponential Histograms (or just histograms for those still doing gauges of percentiles / summary)

An API <-> API bridge provides the easiest path forward for instrumentation authors to accomplish that goal.

bogdandrutu commented 2 years ago

@jsuereth Nice to hear all these great things, but I want to also ensure that we are not adding unnecessary complexity to our APIs to support this, hence the "very divergent" part that you mention. The proposal says if this API has a bad thing we should support it because we prefer this type of integration.

We should prefer adapting existing metric instrumentation rather than trying to replace it. Micrometer is a decent API. We're not in competition, and we can be complementary. We have no reason to force users to re-instrument if they're happy.

A "MetricProducer bridge" will give that to the micrometer users.

When we interface with existing Metric APIs we should enhance. There needs to be a reason users want OTEL, and in my opinion we (will) bring a few things to the table:

Resource-based telemetry correlation

A "MetricProducer bridge" will give that to the micrometer users. Resource can still be attached to the bridge instance.

Out of the box exemplars Exponential Histograms (or just histograms for those still doing gauges of percentiles / summary)

Not 100% convinced that we cannot offer this, but I think we can with a bit of work.

@jsuereth it is very important to distinguish between a "instrumentation user friendly API" which is design to be used in the application instrumentation vs an "sink" API that you are doing when wrapping this API. I don't want us to become a "sink" API, if that is the goal then let's have a sink API design and not try to change the current API designed for users to instrument application to fit that goal.

Another argument is that micrometer is not an API :) sorry but it has way too much in that artifact to be considered an API.

bogdandrutu commented 2 years ago

@jsuereth also I do agree that they are good reasons to support unregister/close/etc for an instrument, but I don't think the reason should be that another API has it.

bogdandrutu commented 2 years ago

@jsuereth @mateuszrzeszutek is https://github.com/open-telemetry/opentelemetry-specification/pull/2317 resolving this? I don't have that feeling...

jmacd commented 2 years ago

To clarify my intention, I believe #2317 narrowly solves this problem stating the SDK SHOULD give applications a way to create callbacks that support being unregistered, which allows a user to stop certain asynchronous measurements from reporting without shutting down an entire MeterProvider.

The Prometheus ecosystem uses a Delete() verb to signal an intention to erase a metric from an exporter. None of the verbs Stop(), Close() or Destroy() signal the same intention to me, and in the presence of duplicate instrument registration support (i.e., #2317) I would expect stop/close/destroy to effect the single instance they were called on. I would not expect one instance's stop/close/destroy to effect other instances, and I would not expect any of these to delete the streams reported from a specific instrument instance. I would not want instrument "instances" to matter to the SDK semantically speaking.

As the description states, users can simply stop using synchronous instruments if they want to stop reporting measurements; this doesn't seem like a problem we need to help the user with, but if we did, I would suggest that stop/close/destroy simply invalidates the instrument instance so that it can no longer be used to capture measurements.

I suggest we file a separate issue or issues to discuss the following points:

Need for Delete() verb?

The Delete() verb for instruments, and whether it has a meaningful definition in an API that supports duplicate instrument registration. In Prometheus, Delete() is used with specific instruments and labels, i.e. it doesn't delete an entire Metric, it deletes a single stream. I think of this "deletion" as an exporter preference, in any case. Prometheus expects that exporters remember every current/cumulative value they've ever exported, except those for which Delete() was used.

Need for exporter memory preferences?

In a push-based exporter, whether using cumulative or delta temporality, there's a question of whether to push a value that has not changed. Since we expect cumulative data to be pushed into a Prometheus ecosystem, it seems wise to default to pushing all values, even unchanged ones.

In an exporter with a preference for deltas, it's typical to avoid reporting any delta where the delta (sum) or count (histogram) are zero. For synchronous instruments, this is something the SDK can facilitate by detecting stale streams and, after a while, forgetting them. SHOULD the SDK be required to support forgetting streams? For asynchronous instruments, if the user simply stops reporting values they will stop reporting, unless we require the SDK to detect staleness. Should the Prometheus exporter be responsible for this on its own, or should the SDK facilitate an exporter preference for "synchronous instrument memory"? This topic is connected with #2132 because if the SDK is required to detect staleness, then no additional memory is required to perform cumulative-to-delta translation.

@mateuszrzeszutek please signal whether simply the option to unregister callbacks is enough, coupled with memory options for exporters? Or are you looking for something like Prometheus' Delete()?

mateuszrzeszutek commented 2 years ago

the option to unregister callbacks is enough, coupled with memory options for exporters? Or are you looking for something like Prometheus' Delete()?

I believe that should be enough. Removing all callbacks of an async instrument effectively functions like Delete(), and if the sync instruments' metric streams were to be forgotten after the SDK had determined them to be stale it'd fulfill my requirements for this story. (And, as you pointed out, it's pretty unclear how Delete() should work for synchronous instruments anyway).

Shall I close this issue? We can open another one just for the exporter memory feature.

marcalff commented 2 years ago

Greetings,

When code can be loaded and unloaded in a dynamic plugin (for example, dlopen() in C/C++), the application needs a way to unregister asynchronous instruments on unload, to avoid having the SDK call stale callbacks.

Without this, asynchronous metrics seem impossible to use with dynamic load / unload.

Regards.

jack-berg commented 2 years ago

Each asynchronous instrument already requires support for an unregister() method. See asynchronous counter, for example.

marcalff commented 2 years ago

Thanks @jack-berg for the clarifications.

The spec wording is (emphasis mine):

For callback functions registered after an asynchronous instrument is created, the API is required to support a mechanism for unregistration.

In my understanding:

Currently the opentelemetry-cpp API does the later.

Regards

marcalff commented 1 year ago

@open-telemetry/specs-approvers

Some update on this issue, about metrics.

opentelemetry-cpp now provide a way to remove a callback for an async instrument.

While this resolves the immediate concern, crashes that happened previously when a measurement is done invoking a defunct callback function (located in a shared library that was unloaded), there are remaining issues.

First, the Meter still has definition for instruments, sync and async, which remain while not measured. Second, the meter provider still has references to meters that remain while not producing any measurements at all.

The problem here is that exporters still export data for these defunct meters / metrics, as tested with the OTLP HTTP exporter.

(Edit 2023-06-12: To analyse in detail)

It seems the opentelemetry specification lacks the following:

A spec is needed for languages (like opentelemetry-cpp) to implement it.

This is a major concern, as there is no way in practice to use Metrics for OpenTelemetry in an environment where libraries are loaded and unloaded dynamically, when these libraries provide their own instrumentation.

alxbl commented 1 year ago

I'd like to add to this issue in favor of having a cleanup mechanism in the spec:

We perform monitoring of IoT devices which can be added/removed from our systems at any time by operators and would also benefit from being able to stop reporting metrics when those IoT devices are being decommissioned.

Currently, the .NET implementation of asynchronous counter/gauges supports removing metric points from a specific metric, but this causes the last value emitted for that metric point to be re-exported ad-infinitum (or just become a stale entry in the metric point list when using delta export mode) by the OTEL SDK implementation.

Another thing that would be interesting to specify is stale markers when a time series is known to be "finished".

e.g. Prometheus Remote Write also defines stale markers that can tell Prometheus that a time series is known to be stale and immediately marks it as stale so that it no longer shows up. I think it would be nice to have a specified way to do something similar via OTLP.

I would be willing to spend some time working on this if there is interest.

Thanks for the hard work on OTEL, by the way!

EDIT: I had the wrong link to stale markers in Prometheus

mtwo commented 3 months ago

We're elevating this and we have closed #3985 in favour of this issue