open-telemetry / opentelemetry-python

OpenTelemetry Python API and SDK
https://opentelemetry.io
Apache License 2.0
1.77k stars 619 forks source link

Exception while exporting Span batch - Celery #2678

Closed Shackelford-Arden closed 2 years ago

Shackelford-Arden commented 2 years ago

Describe your environment

Python Version: 3.8.12 ('Linux-3.10.0-1160.62.1.el7.x86_64-x86_64-with-glibc2.2.5')

OpenTelemetry depdenencies:

opentelemetry-api                      1.7.1
opentelemetry-exporter-otlp            1.8.0
opentelemetry-exporter-otlp-proto-grpc 1.8.0
opentelemetry-exporter-otlp-proto-http 1.8.0
opentelemetry-proto                    1.8.0
opentelemetry-sdk                      1.7.1
opentelemetry-semantic-conventions     0.26b1

This project is using Celery and the message that we're finding is only experienced in the Celery worker.

We have the following environment variables set within the container (Docker):

GRPC_POLL_STRATEGY="epoll1"
GRPC_ENABLE_FORK_SUPPORT=1

Steps to reproduce

Describe exactly how to reproduce the error. Include a code sample if applicable.

We've been able to reproduce the error every time we initiate work within a Celery task. The project is a bit large, so I'll come back with a smaller scale example.

By the time we get this first log message, we've generated about 50 spans.

What is the expected behavior?

I expected spans to be exported without receiving a log message from the exporter showing up.

What is the actual behavior?

We receive the following exception at what appear to be random times:

Traceback (most recent call last):
  File \"/usr/local/lib/python3.8/site-packages/opentelemetry/sdk/trace/export/__init__.py\", line 358, in _export_batch
    self.span_exporter.export(self.spans_list[:idx])  # type: ignore
  File \"/usr/local/lib/python3.8/site-packages/opentelemetry/exporter/otlp/proto/grpc/trace_exporter/__init__.py\", line 321, in export
    return self._export(spans)
  File \"/usr/local/lib/python3.8/site-packages/opentelemetry/exporter/otlp/proto/grpc/exporter.py\", line 274, in _export
    self._client.Export(
  File \"/usr/local/lib/python3.8/site-packages/grpc/_channel.py\", line 944, in __call__
    state, call, = self._blocking(request, timeout, metadata, credentials,
  File \"/usr/local/lib/python3.8/site-packages/grpc/_channel.py\", line 926, in _blocking
    call = self._channel.segregated_call(
  File \"src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi\", line 496, in grpc._cython.cygrpc.Channel.segregated_call
  File \"src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi\", line 353, in grpc._cython.cygrpc._segregated_call
  File \"src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi\", line 357, in grpc._cython.cygrpc._segregated_call

I don't know if this is helpful to distracting, but in playing around I did set the GRPC_VERBOSITY to debug. Only on the first occurrence of the above exception did GRPC log anything at the same time:

I0510 22:18:56.385133713      26 socket_utils_common_posix.cc:429] Disabling AF_INET6 sockets because ::1 is not available.
D0510 22:18:56.385866369      26 tcp_posix.cc:1852]          cannot set inq fd=26 errno=92

Note: This GRPC logging does not occur again

Additional context

These log messages are only observed when upgrading the opentelemetry-sdk package to 1.7.1.

1.6.2 works without log messages appearing.

This particular project is not able to go above Python 3.8.x, but it is noted that another project that has the exact same architecture but is running on Python 3.9.x doesn't have these log messages, even if we go to the latest version of the API/SDK packages.

Based on the context of when we're seeing these log messages, my guess is it might be related to some of the work that was completed in #2242 .

srikanthccv commented 2 years ago

This looks more like an issue with grpc under specific conditions. Can you share some simple reproducible example?

Shackelford-Arden commented 2 years ago

Apologies on the delay here, took me a little longer to put something together that matched how we were running things. Lovingly, when I tried to replicate it outside of our application, I couldn't. Then for sanity I tried replicating it again in our application and I couldn't. So I'm just crazy apparently.

If I see it again and can replicate, I'll come back. For now, I'll close this out.