open-telemetry / opentelemetry-python

OpenTelemetry Python API and SDK
https://opentelemetry.io
Apache License 2.0
1.8k stars 626 forks source link

Metrics SDK produces ValueError sometimes #4093

Closed pmcollins closed 2 months ago

pmcollins commented 3 months ago

Describe your environment

OS: Intel Mac Python version: 3.8 SDK version: latest main API version: latest main

What happened?

When I use opentelemetry-instrument on a script that sends a single span, the metrics SDK throws a ValueError on around half the runs. The script just gets a tracer and does a single start_as_current_span, but I believe opentelemetry-system-metrics is what runs in the background collecting and sending metrics.

Steps to Reproduce

Use opentelemetry-instrument to run a script containing something like the following

tracer = trace.get_tracer("my-tracer")
  with tracer.start_as_current_span("my-span"):
    print("hello")      

Expected Result

No ValueError exceptions.

Actual Result

(.venv) otel@C02CT31XMD6R opentelemetry-python-contrib % opentelemetry-instrument python trace_loop_grpc.py
Exception while exporting metrics Value out of range: 13228496505721087519
Traceback (most recent call last):
  File "/Users/otel/github/open-telemetry/opentelemetry-python/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/export/__init__.py", line 541, in _receive_metrics
    self._exporter.export(
  File "/Users/otel/github/open-telemetry/opentelemetry-python/exporter/opentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/metric_exporter/__init__.py", line 160, in export
    return self._export(data=metrics_data)
  File "/Users/otel/github/open-telemetry/opentelemetry-python/exporter/opentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py", line 263, in _export
    request=self._translate_data(data),
  File "/Users/otel/github/open-telemetry/opentelemetry-python/exporter/opentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/metric_exporter/__init__.py", line 150, in _translate_data
    return encode_metrics(data)
  File "/Users/otel/github/open-telemetry/opentelemetry-python/exporter/opentelemetry-exporter-otlp-proto-common/src/opentelemetry/exporter/otlp/proto/common/_internal/metrics_encoder/__init__.py", line 260, in encode_metrics
    pt.as_int = data_point.value
ValueError: Value out of range: 13228496505721087519
(.venv) otel@C02CT31XMD6R opentelemetry-python-contrib % 

Additional context

On one run, I was able to print the metric before the ValueError and got:

Metric(name='process.runtime.cpython.context_switches', description='Runtime context switches', unit='switches', data=Sum(data_points=[NumberDataPoint(attributes={'type': 'involuntary'}, start_time_unix_nano=1722370176730749000, time_unix_nano=1722370176730845000, value=0), NumberDataPoint(attributes={'type': 'voluntary'}, start_time_unix_nano=1722370176730749000, time_unix_nano=1722370176730845000, value=10484362086929426074)], aggregation_temporality=<AggregationTemporality.CUMULATIVE: 2>, is_monotonic=True))

The value that caused the error in this case was 10484362086929426074, which is about 10^19.

Would you like to implement a fix?

None

aabmass commented 3 months ago

Since the protobuf field is sfixed64, possibly the exporter should handle this case eagerly or somehow convert it to be in the correct range.

However, the metric process.runtime.cpython.context_switches is coming from here and that seems like an impossibly large number of context switches. Maybe this is a bug in psutil for MacOS?

aabmass commented 3 months ago

@pmcollins any way you can check the output of psutil.Process(os.getpid()).num_ctx_switches() when this happens and see if it an SDK issue or psutil issue?

pmcollins commented 3 months ago

Just running it at the python console from my Intel Mac I get

>>> psutil.Process(os.getpid()).num_ctx_switches()
pctxsw(voluntary=3168756596733903550, involuntary=0)

So maybe a psutil issue (?), but also maybe an issue in the SDK with how we handle too big of numbers.

aabmass commented 3 months ago

Ya that seems unexpected on the psutil side. For reference, on my ARM Mac:

$ python3 -c 'import psutil; import os; print(psutil.Process(os.getpid()).num_ctx_switches())'
pctxsw(voluntary=33, involuntary=0)
emdneto commented 3 months ago

Same here ARM Mac

python3 -c 'import psutil; import os; print(psutil.Process(os.getpid()).num_ctx_switches())'
pctxsw(voluntary=26, involuntary=0)
pmcollins commented 2 months ago

I'd like to work on improving the logging for this scenario. I've removed the bug label because I'm not sure it is one.

lzchen commented 2 months ago

@pmcollins

Would you like to be assigned to the issue?