Metrics exporting without a trace regression in 1.28.0

xrmx commented 1 day ago

Describe your environment

OS: (e.g, Ubuntu) Python version: (e.g., Python 3.8.10) Python 3.10 Package version: 1.28.0

What happened?

It looks like there is a regressions around metrics exporting

Steps to Reproduce

Please note that this is sending data to a collector, serializing to the console is working fine:

touch empty.py
opentelemetry-instrument python empty.py

Expected Result

Metrics exported

Actual Result

  File "/metricsregresions/lib/python3.10/site-packages/opentelemetry/exporter/otlp/proto/common/_internal/metrics_encoder/__init__.py", line 189, in encode_metrics
    _encode_resource_metrics(resource_metrics, resource_metrics_dict)
  File "/metricsregresions/lib/python3.10/site-packages/opentelemetry/exporter/otlp/proto/common/_internal/metrics_encoder/__init__.py", line 240, in _encode_resource_metrics
  File "/metricsregresions/lib/python3.10/site-packages/opentelemetry/exporter/otlp/proto/common/_internal/metrics_encoder/__init__.py", line 240, in _encode_resource_metrics
    raise EncodingException(ex, metric) from None
opentelemetry.exporter.otlp.proto.common._internal.metrics_encoder.EncodingException: Metric(name='process.runtime.cpython.memory', description='Runtime cpython memory', unit='By', data=Sum(data_points=[NumberDataPoint(attributes={'type': 'rss'}, start_time_unix_nano=1730892722110841083, time_unix_nano=1730892722111199546, value=44474368, exemplars=[Exemplar(filtered_attributes={}, value=44474368, time_unix_nano=1730892722110782542, span_id=None, trace_id=None)]), NumberDataPoint(attributes={'type': 'vms'}, start_time_unix_nano=1730892722110867031, time_unix_nano=1730892722111199546, value=1724026880, exemplars=[Exemplar(filtered_attributes={}, value=1724026880, time_unix_nano=1730892722110862362, span_id=None, trace_id=None)])], aggregation_temporality=<AggregationTemporality.CUMULATIVE: 2>, is_monotonic=False))
'NoneType' object has no attribute 'to_bytes'

Additional context

No response

Would you like to implement a fix?

Yes but I need to dig

xrmx commented 1 day ago

Looking at the console export the only None I have are from the exemplars, e.g.:

                                        "exemplars": [
                                            {
                                                "filtered_attributes": {},
                                                "value": 24,
                                                "time_unix_nano": 1730893479734229622,
                                                "span_id": null,
                                                "trace_id": null
                                            }
                                        ]

xrmx commented 1 day ago

I guess this would be the fastest workaround, the proper one would be to export Exemplar without these span_id and trace_id if they are not available instead:

diff --git a/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/exemplar/exemplar_reservoir.py b/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/exemplar/exemplar_reservoir.py
index 1dcbfe47..0e3e4016 100644
--- a/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/exemplar/exemplar_reservoir.py
+++ b/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/exemplar/exemplar_reservoir.py
@@ -104,8 +104,10 @@ class ExemplarBucket:
             span_context = span.get_span_context()
             self.__span_id = span_context.span_id
             self.__trace_id = span_context.trace_id
-
-        self.__offered = True
+            self.__offered = True
+        else:
+            # we cannot serialize invalid spans so stop offering them
+            self.__offered = False

     def collect(self, point_attributes: Attributes) -> Optional[Exemplar]:
         """May return an Exemplar and resets the bucket for the next sampling period."""

xrmx commented 1 day ago

@fcollonval What do you think?

aabmass commented 1 day ago

I guess this would be the fastest workaround, the proper one would be to export Exemplar without these span_id and trace_id if they are not available instead:

Is this happening because tracing SDK hasn't been set up? I would expect to drop the the exemplar in this case but need to dig into the spec

fcollonval commented 1 day ago

@xrmx yes the trace_id and span_id should not be exported if unset. This is actually fixed in https://github.com/open-telemetry/opentelemetry-python/pull/4178/files#diff-69723411a82d22ec56b486c8c744df22b6a4f245d68f057e17263d04d17b3dbe. Unfortunately that PR was not reviewed in time for the latest release.

xrmx commented 1 day ago

@fcollonval thanks, next time if you are aware of bugs please share it with us, it's not so evident if they are fixed in PRs that starts with "Add" and not "Fix" :sweat_smile:

fcollonval commented 1 day ago

Sure - I'll do that.

aabmass commented 1 day ago

@fcollonval shouldn't the SDK be dropping the exemplar before it even gets to the exporter? https://opentelemetry.io/docs/specs/otel/metrics/sdk/#tracebased

fcollonval commented 1 day ago

@fcollonval shouldn't the SDK be dropping the exemplar before it even gets to the exporter? https://opentelemetry.io/docs/specs/otel/metrics/sdk/#tracebased

Indeed @aabmass you are right, the default exemplar filter should not offer measurement as exemplar if the current span is not sampled:

https://github.com/open-telemetry/opentelemetry-python/blob/06809f4e6a7084f7c79a1fd9e974fad9641a6b86/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/exemplar/exemplar_filter.py#L131-L134

So indeed there seems to be another bug here. I'm looking at it.

fcollonval commented 1 day ago

Ok I can reproduce the issue with the example in docs/examples/metrics/instruments/example.py

When I print the measurement with the should sample flag I get:

False Measurement(value=1, time_unix_nano=1730904901413115349, instrument=<opentelemetry.sdk.metrics._internal.instrument._Counter object at 0x7f74a8705f70>, context={}, attributes=None)
False Measurement(value=1, time_unix_nano=1730904901413315139, instrument=<opentelemetry.sdk.metrics._internal.instrument._UpDownCounter object at 0x7f74a8848590>, context={}, attributes=None)
False Measurement(value=-5, time_unix_nano=1730904901413404019, instrument=<opentelemetry.sdk.metrics._internal.instrument._UpDownCounter object at 0x7f74a8848590>, context={}, attributes=None)
False Measurement(value=99.9, time_unix_nano=1730904901413468105, instrument=<opentelemetry.sdk.metrics._internal.instrument._Histogram object at 0x7f74a8883e60>, context={}, attributes=None)
True Measurement(value=1, time_unix_nano=1730904901413655541, instrument=<opentelemetry.sdk.metrics._internal.instrument._ObservableCounter object at 0x7f74a8c77440>, context={}, attributes={})
True Measurement(value=-10, time_unix_nano=1730904901413794618, instrument=<opentelemetry.sdk.metrics._internal.instrument._ObservableUpDownCounter object at 0x7f74a8d086e0>, context={}, attributes={})
True Measurement(value=9, time_unix_nano=1730904901413888022, instrument=<opentelemetry.sdk.metrics._internal.instrument._ObservableGauge object at 0x7f74a88485c0>, context={}, attributes={})

So there is definitely another bug for observable instrument.

xrmx commented 1 day ago

@fcollonval If you haven't already added them you can cherry-pick otlp metrics tests updated with exemplars from here https://github.com/xrmx/opentelemetry-python/commit/7a36010c14712dbf283d4f41e20e6d725a7ff96a

open-telemetry / opentelemetry-python