open-telemetry / opentelemetry-python-contrib

OpenTelemetry instrumentation for Python modules
https://opentelemetry.io
Apache License 2.0
728 stars 600 forks source link

Random connection reset errors affecting Celery #2390

Open danw-mpl opened 7 months ago

danw-mpl commented 7 months ago

Describe your environment

opentelemetry-api==1.24.0
opentelemetry-distro==0.45b0
opentelemetry-exporter-otlp==1.24.0
opentelemetry-exporter-otlp-proto-common==1.24.0
opentelemetry-exporter-otlp-proto-grpc==1.24.0
opentelemetry-exporter-otlp-proto-http==1.24.0
opentelemetry-instrumentation==0.45b0
opentelemetry-instrumentation-botocore==0.45b0
opentelemetry-instrumentation-celery==0.45b0
opentelemetry-instrumentation-dbapi==0.45b0
opentelemetry-instrumentation-django==0.45b0
opentelemetry-instrumentation-logging==0.45b0
opentelemetry-instrumentation-psycopg2==0.45b0
opentelemetry-instrumentation-redis==0.45b0
opentelemetry-instrumentation-requests==0.45b0
opentelemetry-instrumentation-wsgi==0.45b0
opentelemetry-propagator-aws-xray==1.0.1
opentelemetry-proto==1.24.0
opentelemetry-sdk==1.24.0
opentelemetry-sdk-extension-aws==2.0.1
opentelemetry-semantic-conventions==0.45b0
opentelemetry-util-http==0.45b0

Steps to reproduce Run a task on a Celery worker with opentelemetry-instrument.

What is the expected behavior? No errors reported.

What is the actual behavior? Any task a Celery worker executes results in an HTTP connection reset error or gRPC equivalent, but the traces are still sent successfully.

Additional context I'm not getting these errors on non-Celery processes such as Gunicorn, etc.

It's incredibly challenging to diagnose this issue, so I'm not certain whether it's an issue with my stack or how Celery is handling auto instrumentation.

Anyone else seen this issue?

danw-mpl commented 3 months ago

This is still ongoing sadly. The client Python logs look like Transient error StatusCode.UNAVAILABLE encountered while exporting traces to ....

Any ideas would be greatly appreciated!