open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.75k stars 889 forks source link

Support for synchronous Gauge and an UpDownCounter that can be Set() #2318

Closed jmacd closed 1 year ago

jmacd commented 2 years ago

What are you trying to achieve?

There are long-standing requests to support a settable Gauge, as this is an API both Prometheus and Statsd support.

In the OpenTelemetry data model, the same request happens for an UpDownCounter that can be set. Whereas an asynchronous UpDownCounter can be set from a callback, users are asking for an API to set a Gauge or cumulative UpDownCounter from synchronous code (and, for that matter, a cumulative Counter)

The OpenTelemetry data model says exactly what kind of point should be produced for these events (Gauges and cumulative Sums), so we know what the API should be and what the data should look like, but still there's been an objection. Let's focus on Gauge, since it has a longer history.

You can argue that by making the OTel Gauge instrument only support asynchronous use from a callback, we are improving the efficiency of a monitoring solution by preventing the user from making extraneous observations.

The counter-argument is that you're making the user's life more difficult, particularly in situations where they are holding a value and do not want to defer it until a callback. See, e.g., #2054.

One of the objections to having a synchronous Gauge is that depending on the collection interval, there may or may not be gaps in the series. However, the data model discusses gap handling, and the resulting series with gaps is well-defined. The result, some will argue, is working as intended.

The big distinction in terms of implementation between the synchronous and asynchronous instruments inside the SDK is whether an aggregator will see many updates per interval or is expected to see one update per interval. Our specification is not very strong on what should happen when multiple observations are made for a single instrument and attributes during a collection interval. The result is currently unspecified.

Proposal

This proposal would be easy to pull off in OTel-Go because we are leaning toward using similar APIs for synchronous and asynchronous instruments, i.e., in OTel-Go an asynchronous instrument is just like a synchronous instrument that you only call from a registered callback.

To add the requested APIs:

  1. Allow registering "background asynchronous" Counter, UpDownCounter, and Gauge instruments. These have the same API as the asynchronous instruments, i.e., Observe(value) for setting cumulative and current values but would be used outside of registered callbacks
  2. Enforce that at most 1 value is observed per collection per aggregator. We have to prevent the natural merge function from happening if more than one observation is received between collections, otherwise this does not provide the requested functionality. This would resolve the unspecified behavior of the current specification for duplicate observations in favor of a strict last-value-wins policy.

The result would be as expected. If the user does not update a synchronous gauge during a collection interval, missing data (i.e., a gap). If the user updates once or more between collection intervals, the last value will be observed.

Overall, this seems like a small extension to the current specification, does not create any new instruments or APIs or data model, and gives users what they've been asking for.

emilgelman commented 2 years ago

Hey @jmacd,

I appreciate your effort on this matter. I was wondering if there was any progress made regarding this issue?

I'll try to provide some additional context. We are using the OpenTelemetry Go SDK to instrument our services. In one of our services, we would like to measure the duration of a specific operation. We have the value during the execution of a specific synchronous function - so we can't use an async instrument, and we do not wish to use traces to achieve the same behavior as our backend has different billing and functionality provided for traces.

Our current workaround is to use a custom in memory map (to achieve last value aggregation) and an asynchronous gauge to report those map values, and then flush it every collect interval. Having to implement a custom data structure just to support a simple metric type (which exists in practically any other metric stack) is both cumbersome and not intuitive at all.

I think that the argument of not implementing this as it improves the efficiency of monitoring is somewhat inconsistent with the whole approach of OpenTelemetry. Is there anything else that can be done to make this happen?

Aaronontheweb commented 2 years ago

Related since I asked about this in the dotnet implementation repo https://github.com/open-telemetry/opentelemetry-dotnet/issues/2362#issuecomment-1077756891

Our use case is that we have millions of actors whose mailbox depth we would like to periodically report on - but only when the actor is actually processing messages. Having an observable gauge for each of those would be non-performant for our use cases - right now we're trying to use histograms and it's a bit of a hack.

jmacd commented 2 years ago

From CNCF Slack we have a use-case described:

if the CRD is present then it is denoted by 1 as the metric value and if it's not (if it's deleted) then it is denoted by 0 ...so everytime I want to set the metric value to 0 (and observe it) whenever I delete the CRD.

In this case, the application as built does not keep persistent state, it just observes changes to the current value of a cumulative sum or gauge quantity.

bogdandrutu commented 2 years ago

@Aaronontheweb why the current "UpDownCounter" is not good? Curios what exactly is the metric about the actors that you want to report.

bogdandrutu commented 2 years ago

if the CRD is present then it is denoted by 1 as the metric value and if it's not (if it's deleted) then it is denoted by 0 ...so everytime I want to set the metric value to 0 (and observe it) whenever I delete the CRD.

@jmacd now that I see better this example, can that be an UpDownCounter where +1 on start and -1 on end? I know it is a bit of an abuse of the API, but want to understand that if this is good as a short term until we get more proof that we need sync Gauge.

Aaronontheweb commented 2 years ago

@Aaronontheweb why the current "UpDownCounter" is not good? Curios what exactly is the metric about the actors that you want to report.

Because I'm measuring the point-in-time size of the queue depth at any given time, rather than its rate of change - I can measure the rate at which actors process messages with a SumCounter and I can measure per-message processing latency with a histogram, but I really need a gauge to measure the backlog of the queue - and rather than create an ObservableGauge for hundreds of thousands / millions of small queues it'd be better to share a single Gauge as a bounded metric that is differentiated by its attributes (which are low cardinality) - same as we do for the counter and histogram types.

bogdandrutu commented 2 years ago

@Aaronontheweb "UpDownCounter" is not for "rate". The "UpDownCounter" is as a name says a counter (sum) that supports positive and negative values, so your backlog will do "+1" when element is added and "-1" when is removed. And you get the right value.

Aaronontheweb commented 2 years ago

I can't track when a value is added to it - I can only check it when the actor is running.

tomasfreund commented 2 years ago

I have a similar issue - when processing messages from e.g. Azure EventHub, I have the offset and enqueued time of all messages in the batch and I get the last offset and last enqueued time from EventHub every time I fetch a new batch. Ideally, I'd use this to calculate consumer lag (both in messages and seconds) and record the value in a gauge. I can't use an UpDownCounter because I don't want to add to the value, I need to set the value, it comes from an external system. I also don't want to use an asynchronous gauge because that would require me to write additional code just to handle fetching the value on demand and I would also need to do extra I/O every time metrics are collected.

There are many other cases when you might need a synchronous gauge. An example might be an IoT device where some module periodically reports some values to the main control unit. This main control unit would like to record these values somewhere the moment it gets them. At this point there is no way to record gauge metrics this way. Sure, you could store them in local variables and then use an async gauge to collect the values on metrics collection but it seems very convoluted. Of course an async gauge is useful for many things, it's just that I don't understand why it would be a good idea to only have an async version.

Aaronontheweb commented 2 years ago

Any updates on this as a consideration?

rlinehan commented 2 years ago

Would really, really like to see this supported.

Our use case: we run a set of tests against our service via an AWS Lambda every minute. We want to emit a gauge indicating whether the tests succeeded or not (basically a binary "1" or "0"). The code is much more complicated with an async gauge than a synchronous gauge since we just want to issue it once and right at the end of the test run, and because we're in a Lambda and need to make sure it emits while the Lambda is running.

jmacd commented 2 years ago

This is a top priority (IMO) and I will begin working on the (very straightforward) proposal in the coming few weeks.

jmacd commented 2 years ago

^^^ Here is the prototype I am working with,.

ty-elastic commented 2 years ago

Hi, I'd love to see support for synchronous Gauges as well! The lack of sync gauges can make it difficult to interface with existing libraries which push metrics to your app at specific intervals which they define. In this example (https://github.com/ty-elastic/videojs_observability), I am pushed metrics which I then copy into a global map. From there, I have to coax otel into turning around and reading/reporting those metrics (some as gauges) asynchronously. At best, the timing is wrong (the data is timestamped at the time of async reading rather than then time when they are pushed to me). A better approach would allow one to write gauge values synchronously with a proper timestamp; the last value is then reported when the exporter gathers metrics.

jack-berg commented 2 years ago

A better approach would allow one to write gauge values synchronously with a proper timestamp

I've been assuming that the timestamps reported for synchronous gauges would be the time of collection, not the time the instrument was most recently "set". A deviation from this would be inconsistent with other instruments / aggregations.

ty-elastic commented 2 years ago

I've been assuming that the timestamps reported for synchronous gauges would be the time of collection, not the time the instrument was most recently "set". A deviation from this would be inconsistent with other instruments / aggregations.

interesting.. that is a reasonable assumption for sync counters, but for sync gauges, it seems like ideally you would assert the timestamp and value together to ensure that, from an observability perspective, you can say "oh - this gauge was at this water mark at this specific time" rather than "this gauge was at this water mark sometime in between this collection interval". for most apps, it probably matters not. for "real-time" apps (e.g., a video compression engine), that difference could be confusing.

zbindenren commented 1 year ago

Any news on this? I think a lot of people need this.

gaeljw commented 1 year ago

Just a quick comment to describe a use-case of mines: we have an application that runs a scheduled task to count inconsistencies in our data. This computation might be long enough to not want to do it when the instrument is observed but only do it on a schedule that the application itself defines and is responsible of.

We used to use Prometheus gauges.

With OpenTelemetry, this is not possible with gauges. Technically it would be possible to use a counter but from a "functional" point of view, we are not computing delta values, we are computing a "total value" at a scheduled rate.

Edit: I guess that the thing that makes more sense in my case is to store the value on my end somehow (in memory) and read it async via the async gauge.

nickb937 commented 1 year ago

I have a few applications that do data processing and I have a desire to expose some of the internal variables as metrics around the internal health of the system. The metrics aren't time-series as such, they are samples of a value of a variable at a point-in-time that I desire to emit to the monitoring system for observation. For example, several batches of CSV files come in each day and I may want to observe the number of lines read from those files.

The Gauge datatype looks like the correct type for publishing internal metrics because each sample is distinct from the previous sample.

However, there's no way to publish a Gauge value directly (synchronously), like this:

file_lines = open("/tmp/datafile.csv", "r").readlines()
metric_lines = meter.create_gauge("line-count")
metric_lines.set_value(len(file_lines))

It's unclear how I would go about emitting the value of a variable through an ObservableGauge without first publishing the variable into a global variable for sampling with the callback, which isn't a realistic solution:

LINE_COUNT = 0

def observable_gauge_func(options):
    global LINE_COUNT
    yield Observation(LINE_COUNT, {})

file_lines = open("/tmp/datafile.csv", "r").readlines()
meter.create_observable_gauge("line-count", [observable_gauge_func])
LINE_COUNT = len(file_lines)
Woodrow1026 commented 1 year ago

We also need such functions very much,

In our scenario, there are a large number of test cases to be executed, but due to performance reasons, they are only run once a day (configurable). We use the histogram to get whether each case is successful (0, 1 binary)

But now, there are some use cases that need to be returned. We want to care about their return values. At this time, histogram is no longer appropriate, because some values in the bucket will be lost. Asynchronous dashboards are also not suitable for us because we execute use cases through scheduling. The operation of UpDownCounter is strange.

benpoliquin commented 1 year ago

+1 on the need for this synchronous gauge type.

ascrookes commented 1 year ago

I could also benefit from a synchronous gauge type. Like others in this thread, I want to use it to measure queue depth and unique users in the queue across time. I'm using the Python library.

Do let me know if there is another metrics instrument that could be used for this but I didn't see one that would be a better fit.

In the meantime, I have a simple Python class that wraps the asynchronous gauge so I can set point in time values without using the callback logic in my application code. It only reports out on the interval you have set on the metrics exported. That means if I set a value, and then set another value before the initial value has been exported, it'll get dropped. There is probably a way to fix that but for my use case that trade-off is fine.

from typing import Iterable

from opentelemetry import metrics

# Alias to make it easier to access the Observations class.
# This takes in a single value (float or int) and an optional dictionary
# of attributes.
# https://opentelemetry-python.readthedocs.io/en/latest/api/metrics.html#opentelemetry.metrics.Observation
MetricsGaugeObservation = metrics.Observation

class MetricsGauge:
    def __init__(self, name: str, meter: metrics.Meter):
        self._name = name
        self._meter = meter
        self._internal_gauge = None
        self._observations = []

    def set_observations(self, observations: Iterable[MetricsGaugeObservation]):
        self._observations = observations
        # The async gauge starts reporting as soon as it is created. Create
        # the async gauge once a real value has been set to avoid creating noise
        # prior to values getting set.
        if self._internal_gauge is None:
            self._internal_gauge = self._meter.create_observable_gauge(
                name=self._name, callbacks=[self._gauge_callback]
            )

    def _gauge_callback(self, _options):
        return self._observations

# In application code

meter = get_meter(__name__)
gauge = MetricsGauge("gauge.name", meter)
# When the gauge value changes
new_value, new_attributes = func_to_get_gauge_value_and_attributes()

# This value takes an interable so that you can set multiple values and attributes
# for the same timestamp.
gauge.set_observations([MetricsGaugeObservation(new_value, new_attributes)])
GlassonGlassoff commented 1 year ago

@jmacd Did you ever get any traction on the proposal you mentioned in July?

The lack of sync gauges is causing a massive headache for us in a service where state isn't saved, so there isn't a great way to handle callbacks.

lsytj0413 commented 1 year ago

any update for this proposal?

triblondon commented 1 year ago

Here's my polyfill for JavaScript:

class OTelGauge {
  private currentValue: number;
  private gauge: Observable;
  constructor(meter: Meter, name: string, initValue?: number) {
    this.currentValue = initValue;
    this.gauge = meter.createObservableGauge(name);
    this.gauge.addCallback((result) => result.observe(this.currentValue));
  }
  observeValue(newValue: number) {
    this.currentValue = newValue;
  }
}
tommaybe commented 1 year ago

For queue tracking purposes this one is also really essential for us.

MatisseHack commented 1 year ago

Having a Set method on UpDownCounters would be really useful for my team! We maintain an event driven service and would like to track various metrics related to each event. At the moment we've resorted to exposing a LastEvent property on some of our classes that can be accessed by ObservableUpDownCounters, but this approach isn't nearly as elegant as using a Set method directly when handling the event.

jaelrod commented 1 year ago

Synchronous version of gauge would be helpful for our use case as well, as we have a need for emitting metrics with negative values by absolute value and not incrementation, and currently the only support for negative values is asynchronous gauge/up-down-counter or synchronous up-down-counter.

agunglotto commented 1 year ago

I've ported an application from Prometheus Client to OpenCensus. Both have synchroneous gauge implementation. So the switch wasn't that hard. Now, OpenCensus is no longer supported starting from July and recommends using OpenTelementry. And here is only an asynchroneus gauge. It's really hard to understand why the developers decided "randomly" for sync/async model in OpenTelemetry. And then to see this request open for one year...

jack-berg commented 1 year ago

FYI, I've built a prototype java implementation of this here and am planning on opening a spec PR with a change proposal.

bogdandrutu commented 1 year ago

@jmacd very nice written argument for how to implement Gauge, but I feel that there are not enough examples where the async/observer pattern does not work. The link issue regarding async up down is very weak, unable to understand the argument why that needs to expose a state...

agunglotto commented 10 months ago

There are commits for Python, C++ and .Net - do you plan to provide it for other languages / SDKs like Golang etc.?

freemansoft commented 8 months ago

My bad. Saw that this was picked up in https://github.com/open-telemetry/opentelemetry-specification/pull/3540

amedveshchek commented 8 months ago

I found that the synchronous Gauge metric is already implemented for Golang, but in low-level code, which is still available for the usage -- they simply haven't implemented the convenience methods around that.

Here's the official code of telemetrygen tool, which submits the Gauge metrics by default: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/cmd/telemetrygen/internal/metrics/worker.go#L52-L64

So I used this code and worked it around for now. Will replace it with the original synchronous Gauge once it's implemented in the official lib for golang.