open-telemetry / opentelemetry-python

OpenTelemetry Python API and SDK
https://opentelemetry.io
Apache License 2.0
1.76k stars 614 forks source link

Don't exit ticker thread on collection failure #3788

Closed rbtz-openai closed 6 months ago

rbtz-openai commented 6 months ago

Description

Currently, if collect raises an uncaught error, _ticker stops and service's metrics disappear. This PR makes it more robust for the one-off-failures. Persistent failures would also be more visible in logs.

Type of change

Does This PR Require a Contrib Repo Change?

Checklist:

srikanthccv commented 6 months ago

Can you help me understand when/where the exception is raised? During the metrics collection, one potential place unexpected Exception could be raised in async instruments which we already addressed here https://github.com/open-telemetry/opentelemetry-python/blob/694445f6a39ab55826b955320c74650e1d2395ae/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/instrument.py#L139-L142.

rbtz-openai commented 6 months ago

Can you help me understand when/where the exception is raised?

This is mostly a catchall for things that are not caught by individual callbacks.

We observed collection ticker exiting on exceptions in Exponential Histograms, e.g. the one in:

srikanthccv commented 6 months ago

That's a bug that should be fixed. I am not sure how I feel about the catch-all here. We know sync instruments collection shouldn't cause any unexpected issues (I don't remember why mapping raises an Exception) and async instruments collection is already handled.

rbtz-openai commented 6 months ago

That's a bug that should be fixed. I am not sure how I feel about the catch-all here. We know sync instruments collection shouldn't cause any unexpected issues (I don't remember why mapping raises an Exception) and async instruments collection is already handled.

This is mostly a behavioral question: in steady state this should never happen. But in cases when something unexpected happens (new feature or regression) how do we want SDK's PeriodicExportingMetricReader to behave? Should it stop on error or should it try its best to continue? Current PR, assumes that it is better to be loud and retry.