Closed rbtz-openai closed 6 months ago
Can you help me understand when/where the exception is raised? During the metrics collection, one potential place unexpected Exception could be raised in async instruments which we already addressed here https://github.com/open-telemetry/opentelemetry-python/blob/694445f6a39ab55826b955320c74650e1d2395ae/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/instrument.py#L139-L142.
Can you help me understand when/where the exception is raised?
This is mostly a catchall for things that are not caught by individual callbacks.
We observed collection ticker exiting on exceptions in Exponential Histograms, e.g. the one in:
That's a bug that should be fixed. I am not sure how I feel about the catch-all here. We know sync instruments collection shouldn't cause any unexpected issues (I don't remember why mapping raises an Exception) and async instruments collection is already handled.
That's a bug that should be fixed. I am not sure how I feel about the catch-all here. We know sync instruments collection shouldn't cause any unexpected issues (I don't remember why mapping raises an Exception) and async instruments collection is already handled.
This is mostly a behavioral question: in steady state this should never happen. But in cases when something unexpected happens (new feature or regression) how do we want SDK's PeriodicExportingMetricReader
to behave? Should it stop on error or should it try its best to continue? Current PR, assumes that it is better to be loud and retry.
Description
Currently, if collect raises an uncaught error, _ticker stops and service's metrics disappear. This PR makes it more robust for the one-off-failures. Persistent failures would also be more visible in logs.
Type of change
Does This PR Require a Contrib Repo Change?
Checklist: