open-telemetry / opentelemetry-python

OpenTelemetry Python API and SDK

https://opentelemetry.io

Apache License 2.0

1.76k stars 614 forks source link

Don't exit ticker thread on collection failure #3788

Closed rbtz-openai closed 6 months ago

rbtz-openai commented 6 months ago

Description

Currently, if collect raises an uncaught error, _ticker stops and service's metrics disappear. This PR makes it more robust for the one-off-failures. Persistent failures would also be more visible in logs.

Type of change

[x] Bug fix (non-breaking change which fixes an issue)

Does This PR Require a Contrib Repo Change?

[ ] Yes. - Link to PR:
[x] No.

Checklist:

[ ] Followed the style guidelines of this project
[x] Changelogs have been updated
[ ] Unit tests have been added
[ ] Documentation has been updated

srikanthccv commented 6 months ago

Can you help me understand when/where the exception is raised? During the metrics collection, one potential place unexpected Exception could be raised in async instruments which we already addressed here https://github.com/open-telemetry/opentelemetry-python/blob/694445f6a39ab55826b955320c74650e1d2395ae/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/instrument.py#L139-L142.

rbtz-openai commented 6 months ago

Can you help me understand when/where the exception is raised?

This is mostly a catchall for things that are not caught by individual callbacks.

We observed collection ticker exiting on exceptions in Exponential Histograms, e.g. the one in:

https://github.com/open-telemetry/opentelemetry-python/issues/3600

srikanthccv commented 6 months ago

That's a bug that should be fixed. I am not sure how I feel about the catch-all here. We know sync instruments collection shouldn't cause any unexpected issues (I don't remember why mapping raises an Exception) and async instruments collection is already handled.

rbtz-openai commented 6 months ago

That's a bug that should be fixed. I am not sure how I feel about the catch-all here. We know sync instruments collection shouldn't cause any unexpected issues (I don't remember why mapping raises an Exception) and async instruments collection is already handled.

This is mostly a behavioral question: in steady state this should never happen. But in cases when something unexpected happens (new feature or regression) how do we want SDK's PeriodicExportingMetricReader to behave? Should it stop on error or should it try its best to continue? Current PR, assumes that it is better to be loud and retry.