open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.75k stars 889 forks source link

Clarification needed on how values for attributes filtered by a view are used #2905

Closed MrAlias closed 9 months ago

MrAlias commented 2 years ago

Currently the specification states:

The SDK MUST provide the means to register Views with a MeterProvider. Here are the inputs: ...

  • The configuration for the resulting metrics stream: ...
    • A list of attribute keys (optional). If provided, the attributes that are not in the list will be ignored. If not provided, all the attribute keys will be used by default (TODO: once the Hint API is available, the default behavior should respect the Hint if it is available).

For asynchronous counter instruments (counter and up-down-counter), how should their sums be reported when this "filter" is defined?

For example, if there is an asynchronous counter that observes 1 for the attribute set version=1 and then, in the same callback, 2 for the attribute set version=2 and a list of attribute keys equaling {"foo"} is used in a view, both attribute sets (version=1 and version=2) should be ignored according to the specification. Therefore, they both become observations of the empty attribute set. How should their sums be combined? Should the SDK report 2, the last recorded value for the empty attribute set? Or, should it report 3 the combination?

I had assumed the latter would be expected, but when looking at the python implementation they use the former. This is based on the limit_num_of_attrs.py example:

import random
import time
from typing import Iterable

from opentelemetry.metrics import (
    CallbackOptions,
    Observation,
    get_meter_provider,
    set_meter_provider,
)
from opentelemetry.sdk.metrics import MeterProvider, ObservableGauge
from opentelemetry.sdk.metrics.export import (
    ConsoleMetricExporter,
    PeriodicExportingMetricReader,
)
from opentelemetry.sdk.metrics.view import View

view_with_attributes_limit = View(
    instrument_type=ObservableGauge,
    instrument_name="observable_gauge",
    attribute_keys={"foo"},
)

exporter = ConsoleMetricExporter()

reader = PeriodicExportingMetricReader(exporter, export_interval_millis=1_000)
provider = MeterProvider(
    metric_readers=[
        reader,
    ],
    views=[
        view_with_attributes_limit,
    ],
)
set_meter_provider(provider)

meter = get_meter_provider().get_meter("reduce-cardinality-with-view", "0.1.2")

def observable_gauge_func(options: CallbackOptions) -> Iterable[Observation]:
    yield Observation(1, {"version": 1})
    yield Observation(2, {"version": 2})

# Async gauge
observable_gauge = meter.create_observable_gauge(
    "observable_gauge",
    [observable_gauge_func],
)

while 1:
    time.sleep(1)

(caveat, my python understanding is limited so I could have missed something here)

When running this:

$ python limit_num_of_attrs.py
{"resource_metrics": [{"resource": {"attributes": {"telemetry.sdk.language": "python", "telemetry.sdk.name": "opentelemetry", "telemetry.sdk.version": "1.12.0", "service.name": "unknown_service"}, "schema_url": ""}, "scope_metrics": [{"scope": {"name": "reduce-cardinality-with-view", "version": "0.1.2", "schema_url": ""}, "metrics": [{"name": "observable_gauge", "description": "", "unit": "", "data": {"data_points": [{"attributes": {}, "start_time_unix_nano": 0, "time_unix_nano": 1666898388887952217, "value": 2}]}}], "schema_url": ""}], "schema_url": ""}]}
{"resource_metrics": [{"resource": {"attributes": {"telemetry.sdk.language": "python", "telemetry.sdk.name": "opentelemetry", "telemetry.sdk.version": "1.12.0", "service.name": "unknown_service"}, "schema_url": ""}, "scope_metrics": [{"scope": {"name": "reduce-cardinality-with-view", "version": "0.1.2", "schema_url": ""}, "metrics": [{"name": "observable_gauge", "description": "", "unit": "", "data": {"data_points": [{"attributes": {}, "start_time_unix_nano": 0, "time_unix_nano": 1666898389956287449, "value": 2}]}}], "schema_url": ""}], "schema_url": ""}]}
#...

The value reported is 2.

cc @jmacd @bogdandrutu @reyang @open-telemetry/python-approvers

MrAlias commented 2 years ago

I would be interested to hear how Java, .NET, and other stable implementation behave here. Unfortunately, my ability to write examples there are non-existent.

reyang commented 2 years ago

The Python implementation sounds buggy to me.

For additive (e.g. Counter, UpDownCounter, Asynchronous Counter, Asynchronous UpDownCounter) values, removing a spatial dimension should result in sum.

MrAlias commented 2 years ago

@reyang can you comment on how .NET handles this? Does that project add them together?

cijothomas commented 2 years ago

https://github.com/open-telemetry/opentelemetry-specification/issues/1874 Old issue discussing this.

I can confirm that .NET does not handle this correctly, and has the same bug as python. (I'll create an issue to track)

cijothomas commented 2 years ago

1874 Old issue discussing this.

I can confirm that .NET does not handle this correctly, and has the same bug as python. (I'll create an issue to track)

https://github.com/open-telemetry/opentelemetry-dotnet/blob/73f8d3cb0160f633661854f41866dcdb70b81069/test/OpenTelemetry.Tests/Metrics/MetricAPITest.cs#L800-L803 UnitTest showing the same bug in .NET. (I looked at it at that time and it was not trivial to fix, so just added test to come back to this later.)

MrAlias commented 2 years ago

1874 Old issue discussing this.

I can confirm that .NET does not handle this correctly, and has the same bug as python. (I'll create an issue to track)

Gotcha, thanks for confirming :pray:

MrAlias commented 2 years ago

@jack-berg can you confirm the behavior of async counters in Java in the presence of an "attribute reducing" view? I'm wondering if there is any implementation that sums these values.

jmacd commented 2 years ago

This is a good question. The Lightstep metrics SDK and the v0.31 and earlier OTel-Go metrics SDK did sum these instruments. As a case where this matters, potentially, is the Golang runtime/metrics package, which outputs cumulative Counter and UpDownCounter values in a form that is natural for use with asynchronous instruments.

For example, runtime/metrics outputs three metrics, one is a total of the other two:

/gc/cycles/automatic:gc-cycles
    Count of completed GC cycles generated by the Go runtime.

/gc/cycles/forced:gc-cycles
    Count of completed GC cycles forced by the application.

/gc/cycles/total:gc-cycles
    Count of all completed GC cycles.

In the instrumentation package I wrote (forked from the contrib repo), https://github.com/lightstep/otel-launcher-go/tree/main/lightstep/instrumentation/runtime, the pattern is used to discard the total, since the SDK or a downstream consumer can easily recompute the total using attribute removal and the natural merge function for the data point. So in the example above, a cumulative GC count for class=forced and class=automatic will be generated. If you remove class, you get the total back. There are three groups of metrics in the Go-1.20 runtime/metrics package that follow this pattern. For Go-1.20, we get hierarchical CPU timing measurements, so you could filter in more than one meaningful way.

reyang commented 2 years ago

A good way to think about this problem: whether the dimension is removed from the SDK (e.g. by configuring View), or it is removed by the Collector (by reaggregation), or it is removed at the backend/comsumption (via PromQL), the results should be the same.

MrAlias commented 2 years ago

For additive (e.g. Counter, UpDownCounter, Asynchronous Counter, Asynchronous UpDownCounter) values, removing a spatial dimension should result in sum.

@reyang should the Asynchronous Counter, Asynchronous UpDownCounter be added to the list of instrument at the end of that section that are treated as additive?

Also, should the API specification be updated to normatively require these instrument types be treated as additive?

lalitb commented 2 years ago

otel-cpp implementation does the sum for the sync counters matching the "attribute reducing" view, but selects the last value for the async counter. Need to be fixed, will create a issue to track this.

reyang commented 2 years ago

For additive (e.g. Counter, UpDownCounter, Asynchronous Counter, Asynchronous UpDownCounter) values, removing a spatial dimension should result in sum.

@reyang should the Asynchronous Counter, Asynchronous UpDownCounter be added to the list of instrument at the end of that section that are treated as additive?

Which list?

Also, should the API specification be updated to normatively require these instrument types be treated as additive?

I think no. Additive property applies to Sums, and is covered by the SDK specification https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#default-aggregation. It is perfectly fine for the SDK user to configure an UpDownCounter as something different (e.g. gauge) and not apply additive property.

MrAlias commented 2 years ago

For additive (e.g. Counter, UpDownCounter, Asynchronous Counter, Asynchronous UpDownCounter) values, removing a spatial dimension should result in sum.

@reyang should the Asynchronous Counter, Asynchronous UpDownCounter be added to the list of instrument at the end of that section that are treated as additive?

Which list?

The list included in the sentence at the end of the linked section:

In OpenTelemetry, each Instrument implies whether it is additive or not. For example, Counter and UpDownCounter are additive while Asynchronous Gauge is non-additive.

MrAlias commented 2 years ago

Also, should the API specification be updated to normatively require these instrument types be treated as additive?

I think no. Additive property applies to Sums, and is covered by the SDK specification https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#default-aggregation.

Where is it covered? I don't see anywhere in that linked section defining sums as additive.

It also seems like the term additive is applied to instruments, not aggregations. This would make sense as that is where attributes are also defined. i.e.

each Instrument implies whether it is additive or not

reyang commented 2 years ago

The list included in the sentence at the end of the linked section:

In OpenTelemetry, each Instrument implies whether it is additive or not. For example, Counter and UpDownCounter are additive while Asynchronous Gauge is non-additive.

I think yes it can, not necessarily as it is only providing examples (without promising that it will be a complete list).

reyang commented 2 years ago

Where is it covered? I don't see anywhere in that linked section defining sums as additive.

It also seems like the term additive is applied to instruments, not aggregations. This would make sense as that is where attributes are also defined.

Yeah, I can see the challenge here.

The data model specification covered "additive" in the temporality section, and temporality is referenced by the sums section. I wrote the supplementary guideline after discussing with @jmacd, hoping it will provide some help for folks who struggle to understand the data model and API specification. Maybe we should improve the data model spec to bring more clarity by making it more streamlined.

MrAlias commented 2 years ago

Just to get a sanity check: if we don't currently have normative requirements for this in the specification and stable releases of implementations don't do this, does OTel really require this behavior? Is it just considered undefined currently?

Can we add it in a backward-compatible way? Will having implementations change their behavior be backwards-compatible?

MrAlias commented 2 years ago

Where is it covered? I don't see anywhere in that linked section defining sums as additive. It also seems like the term additive is applied to instruments, not aggregations. This would make sense as that is where attributes are also defined.

Yeah, I can see the challenge here.

The data model specification covered "additive" in the temporality section, and temporality is referenced by the sums section. I wrote the supplementary guideline after discussing with @jmacd, hoping it will provide some help for folks who struggle to understand the data model and API specification. Maybe we should improve the data model spec to bring more clarity by making it more streamlined.

Gotcha. Any help clarifying this in the specification would be appreciated :pray:

reyang commented 2 years ago

Just to get a sanity check: if we don't currently have normative requirements for this in the specification and stable releases of implementations don't do this, does OTel really require this behavior? Is it just considered undefined currently?

Can we add it in a backward-compatible way? Will having implementations change their behavior be backwards-compatible?

I think the spec does require it, although it is not expressed in a normative way. Here is my understanding of the complex mental network:

  1. One has to first read the API specification to understand Asynchronous Counter.
  2. Then go to the SDK specification to understand that the default aggregation is Sum.
  3. Then go to the Data Model specification to understand temporality and realize that Sum means additive.
  4. Or by chance discovered the supplementary guideline and realized how additive property should be handled after reading the examples.

Given all these steps, 90% chance folks will run into a trap unless they asked someone who has been deeply engaged in all the three specs...

jack-berg commented 2 years ago

Java doesn't do this correctly either. Opened open-telemetry/opentelemetry-java#4901 to track.

MadVikingGod commented 2 years ago

I want to try and clarify the workings of the async instruments and their default aggregation.

In the case that was raised here, there is a filter and two observes will record to the same filtered counter, we expect the result is a sum of both observations.

What about other similar cases:

  1. If there is no filter, and two observers happen in separate callbacks?
  2. If there is no filter and two observers happen in the same callback?
  3. If there is a filter, and the same attribute sets are used? E.g. case 1, but there is a filter.
  4. What happens if there is a filter and the view says this should be a gauge?

Would it be a good idea to put other a set of edge cases, and the expected output?

reyang commented 2 years ago

the workings of the async instruments and their default aggregation.

In the case that was raised here, there is a filter and two observes will record to the same filtered counter, we expect the result is a sum of both observations.

What about other similar cases:

  1. If there is no filter, and two observers happen in separate callbacks?
  2. If there is no filter and two observers happen in the same callback?
  3. If there is a filter, and the same attribute sets are used? E.g. case 1, but there is a filter.
  4. What happens if there is a filter and the view says this should be a gauge?

@MadVikingGod could you give some concrete examples? I found that the cases you've described can be interpreted differently. Ideally something like this would help to provide clarity https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/supplementary-guidelines.md#asynchronous-example.

If folks feel this area needs more examples to facilitate the understanding, I can add a section under https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/supplementary-guidelines.md#additive-property to explain the spatial aggregation.

MadVikingGod commented 2 years ago

That is the gist of what I was asking about, just more tightly focused on edge cases we have historically found to be non-obvious. Like the behavior in this issue.

If we were to ask "What should be the result in these 4 (+ the original) cases?" I think we would see different interpretations. While I think the supplementary guidelines are useful, they are verbose enough that they can only cover a small set of behaviors. I was thinking more along the lines of a document that easily be translated into tests to help with the validation of these interactions of features. This issue might look like this: https://gist.github.com/MadVikingGod/e3e65d6217d321d5de8ded159f4abc7e

reyang commented 2 years ago

That is the gist of what I was asking about, just more tightly focused on edge cases we have historically found to be non-obvious. Like the behavior in this issue.

If we were to ask "What should be the result in these 4 (+ the original) cases?" I think we would see different interpretations. While I think the supplementary guidelines are useful, they are verbose enough that they can only cover a small set of behaviors. I was thinking more along the lines of a document that easily be translated into tests to help with the validation of these interactions of features. This issue might look like this: https://gist.github.com/MadVikingGod/e3e65d6217d321d5de8ded159f4abc7e

The gist reminded of me https://github.com/w3c/trace-context/tree/main/test, which is a big commitment to introduce and maintain. I feel it is probably a separate topic on its own.

tsloughter commented 2 years ago

Yes, I've wanted something like the tracecontext interop tests for tracing and metrics so bad. I even started on it at one point... Probably need a SIG for it :)