open-telemetry / opentelemetry-python-contrib

OpenTelemetry instrumentation for Python modules
https://opentelemetry.io
Apache License 2.0
698 stars 579 forks source link

Same trace ID for multiple requests with large bodies in Starlette #2852

Open jonathanberthias opened 1 week ago

jonathanberthias commented 1 week ago

Describe your environment

OS: Ubuntu Python version: 3.11 Package version: 0.48b0/1.27.0

$ pip install starlette uvicorn opentelemetry-sdk opentelemetry-instrumentation-asgi httpx
$ pip freeze
anyio==4.4.0
asgiref==3.8.1
certifi==2024.8.30
click==8.1.7
Deprecated==1.2.14
h11==0.14.0
httpcore==1.0.5
httpx==0.27.2
idna==3.8
importlib_metadata==8.4.0
opentelemetry-api==1.27.0
opentelemetry-instrumentation==0.48b0
opentelemetry-instrumentation-asgi==0.48b0
opentelemetry-sdk==1.27.0
opentelemetry-semantic-conventions==0.48b0
opentelemetry-util-http==0.48b0
setuptools==74.1.1
sniffio==1.3.1
starlette==0.38.4
typing_extensions==4.12.2
uvicorn==0.30.6
wrapt==1.16.0
zipp==3.20.1

What happened?

I have been observing that multiple requests' spans are assigned to the same trace ID, thus when viewing traces the spans are mixed together, even though they should be in separate contexts. After some digging, I found this happens when the request body is large (>=65536 bytes)

Steps to Reproduce

A simple instrumented Starlette app:

from starlette.applications import Starlette
from starlette.routing import Route
from starlette.responses import JSONResponse
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.instrumentation.asgi import OpenTelemetryMiddleware
import uvicorn

provider = TracerProvider()
trace.set_tracer_provider(provider)

def trace_id(_):
    trace_id = trace.get_current_span().get_span_context().trace_id
    return JSONResponse({"trace_id": f"{trace_id:x}"})

app = Starlette(routes=[Route("/trace_id", trace_id, methods=["POST"])])
app.add_middleware(OpenTelemetryMiddleware)

if __name__ == "__main__":
    uvicorn.run(app, loop="asyncio")

With the app running, then launch:

from random import random
from typing import Counter
import httpx
import asyncio
import json

async def send(client, body):
    await asyncio.sleep(0.5 * random())
    response = await client.post("http://127.0.0.1:8000/trace_id", json=body)
    return response.json()["trace_id"]

async def run(ntasks, body_size):
    body = "a" * body_size
    async with httpx.AsyncClient() as client:
        tasks = [send(client, body) for _ in range(ntasks)]
        return await asyncio.gather(*tasks)

async def main():
    body_size_ok = 65534
    body_size_error = 65535
    nrequests = 20
    ids = await run(nrequests, body_size_ok)
    assert len(set(ids)) == nrequests

    ids = await run(nrequests, body_size_error)
    assert len(set(ids)) < nrequests
    print(Counter(ids))
    # print out:
    # Counter({'113ea5825879c5ce05a6b555f6b2ad63': 17, '8ca53924f5fdbae18093abaf96c4b22d': 2, 'b5a343b45edf9ee81b8aa65045b7b7af': 1})

if __name__ == "__main__":
    asyncio.run(main())

Expected Result

When the size of the body is not too large (e.g. body_size_ok in the MRE), each request is assigned its own trace ID. It should be the same for larger requests.

Actual Result

The same trace IDs get reused for multiple (concurrent) requests.

Additional context

When running the app with Uvicorn, I noticed that the issue only happens with the loop="asyncio" option, not with loop="uvloop". I didn't dig any further into it, maybe it's just the threshold size that changes? Or perhaps that is where the issue comes from? :shrug:

Would you like to implement a fix?

None

xrmx commented 1 week ago

AFAIU you are getting multiple spans for the same request, so why you would want multiple trace ids for the same request?

jonathanberthias commented 1 week ago

AFAIU you are getting multiple spans for the same request, so why you would want multiple trace ids for the same request?

I guess the description wasn't clear enough 😅 The issue is about multiple requests sharing the same trace id. I think the MRE illustrates what I mean, I am looking at what trace id is assigned inside the endpoint for many requests, and I expect each to have a different trace id. That is what I observe when the requests are small, or when I send the requests without adding some random sleep time.

Does that make things clearer?

jonathanberthias commented 1 week ago

The threshold comes from Uvicorn, it is the limit where a request body is split into multiple ASGI messages. https://github.com/encode/uvicorn/blob/47304d9ae76321f0f5f649ff4f73e09b17085933/uvicorn/protocols/http/flow_control.py#L7

xrmx commented 1 week ago

Yeah, that was I supposed it to be. My point is that these are the same request

jonathanberthias commented 1 week ago

Yes all the ASGI messages that carry the body are the same request, but here I am sending multiple requests concurrently, each with a set of messages. In that case, the same trace id is used across different requests.

xrmx commented 1 week ago

Yes all the ASGI messages that carry the body are the same request, but here I am sending multiple requests concurrently, each with a set of messages. In that case, the same trace id is used across different requests.

Ah! Thanks for the explanation