open-telemetry / opentelemetry-python-contrib

OpenTelemetry instrumentation for Python modules
https://opentelemetry.io
Apache License 2.0
695 stars 575 forks source link

Trace exporter errors but metric exporter succeeds #2075

Open ekquasar opened 9 months ago

ekquasar commented 9 months ago

Summary

The OTLPMetricExporter reaches the Collector but the OTLPSpanExporter fails with this error (formatted for readability):

E1130 15:37:07.081350000 6174420992 ssl_transport_security.cc:1511]
    Handshake failed with fatal error SSL_ERROR_SSL:
        error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER

Unfortunately, the error is opaque and does not include remediation instructions.

Environment

Python 3.11.6 on macOS Sonoma 14.1.1

Steps to reproduce

Run a demo app exporting metrics and traces to a collector.

Instrumentation setup

OTLP_ENDPOINT=os.getenv("OTLP_ENDPOINT")
# Resource i.e. whoami
resource = Resource(attributes={
    SERVICE_NAME: "demo"
})
# Traces
tprovider = TracerProvider(resource=resource)
otlp_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=OTLP_ENDPOINT))
tprovider.add_span_processor(otlp_processor)
trace.set_tracer_provider(tprovider)
tracer = trace.get_tracer("demo-tracer")
# Metrics
otlp_metric_reader = PeriodicExportingMetricReader(OTLPMetricExporter(endpoint=OTLP_ENDPOINT))
mprovider = MeterProvider(resource=resource, metric_readers=[otlp_metric_reader])
metrics.set_meter_provider(mprovider)
meter = metrics.get_meter("demo-meter")

Instrumentation definition

latency_histogram = meter.create_histogram(
    "response_time",
    unit="ms",
    description="Time spent in a request",
)
call_count = meter.create_counter(
    "call_count",
    unit="1",
    description="Number of requests",
)

def instrument(func):
    """This decorator instruments a Flask route to create a span for the request.
    The span tracks the request's method, scheme, host, path, etc.
    """
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        req = flask.request
        import time
        _t0 = time.time_ns()
        with tracer.start_as_current_span(func.__name__) as span:
            span.set_attribute(SpanAttributes.HTTP_METHOD, req.method)
            span.set_attribute(SpanAttributes.HTTP_SCHEME, req.scheme)
            span.set_attribute(SpanAttributes.HTTP_HOST, req.host)
            span.set_attribute(SpanAttributes.HTTP_TARGET, req.path)
            span.set_attribute(SpanAttributes.HTTP_USER_AGENT, req.user_agent.string)
            span.set_attribute(SpanAttributes.HTTP_CLIENT_IP, req.remote_addr)
            span.set_attribute(SpanAttributes.HTTP_FLAVOR, req.environ.get('SERVER_PROTOCOL'))
            span.set_attribute(SpanAttributes.NET_HOST_PORT, req.environ.get('SERVER_PORT'))
            span.set_attribute(SpanAttributes.NET_HOST_NAME, req.environ.get('SERVER_NAME'))
            try:
                result = func(*args, **kwargs)
            except Exception as exc:
                span.set_status(trace.Status(trace.StatusCode.ERROR))
                span.record_exception(exc)
                raise
            else:
                span.set_status(trace.Status(trace.StatusCode.OK))
        _t1 = time.time_ns()
        _req_dur = (_t1 - _t0) / 1e6
        request_duration = (span.end_time - span.start_time) / 1e6
        assert request_duration / _req_dur > 0.1, "request duration is not recorded in nanoseconds"
        latency_histogram.record(request_duration)
        call_count.add(1)
        return result
    return wrapper

Make a Flask app

app = Flask(__name__)

@app.route('/')
@instrument
def index():
    sec = 0.225 + random.random() * 0.03
    time.sleep(sec)
    print(f"slept {sec * 1000.} ms")
    return "OK"

Run the app

In one terminal:

OTLP_ENDPOINT=127.0.0.1:4317 flask run --port 8080

Send requests

In another terminal:

while true
    curl localhost:8080
    sleep 1
end

Run the collector

In a third terminal, from the docs:

docker pull otel/opentelemetry-collector:0.90.0
docker run -p 127.0.0.1:4317:4317 -p 127.0.0.1:55679:55679 otel/opentelemetry-collector:0.90.0

Note the 4317 endpoint exposed.

Expected behavior

  1. Spans arrive in the collector
  2. Metrics arrive in the collector

Actual behavior

The metrics appear to arrive successfully, but the spans fail with an openssl error.

Span exporter

I observe this error in the Demo App console logs:

E1130 15:37:07.081350000 6174420992 ssl_transport_security.cc:1511]    Handshake failed with fatal error SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER.
WARNING:opentelemetry.exporter.otlp.proto.grpc.exporter:Transient error StatusCode.UNAVAILABLE encountered while exporting traces to 127.0.0.1:4317, retrying in 1s.

Metric exporter

While it doesn't include the description, units, or the other metadata included in the metric definition, the latency histogram does appear in the console logs for the Collector. This indicates the OTLP protocol successfully exports metrics from the Demo App to the Collector. For example, here is a snippet of the Collector console logs:

Metric #9
Descriptor:
     -> Name: otelcol_processor_batch_batch_send_size
     -> Description: Number of units in the batch
     -> Unit: 
     -> DataType: Histogram
     -> AggregationTemporality: Cumulative
HistogramDataPoints #0
Data point attributes:
     -> processor: Str(batch)
     -> service_instance_id: Str(2d791575-cdb2-48b1-8916-8b01696690d0)
     -> service_name: Str(otelcol)
     -> service_version: Str(0.90.0)
StartTimestamp: 2023-11-30 20:21:51.783 +0000 UTC
Timestamp: 2023-11-30 20:48:31.787 +0000 UTC
Count: 161
Sum: 2571.000000
ExplicitBounds #0: 10.000000
ExplicitBounds #1: 25.000000
ExplicitBounds #2: 50.000000
...
Buckets #0, Count: 0
Buckets #1, Count: 161
...

Additional context

Something extra to note is that, though the metrics arrive, they appear to indicate that the requests are all taking between 25 and 50 ms each. However, I log the request time independently and note that the request times for this Demo App are all 240 ms +/- 15 ms.

LeoQuote commented 4 months ago

I also experienced this, this means that you are trying to connect the endpoint with tls protocol

try setting

    --exporter_otlp_insecure true

if you're experiencing this.