jaegertracing / jaeger-ui

Web UI for Jaeger
http://jaegertracing.io/
Apache License 2.0
1.1k stars 470 forks source link

[Bug]: Critical path tracing shows the wrong critical path for async script #2371

Open magicmark opened 2 weeks ago

magicmark commented 2 weeks ago

👋 I've been on a bit of journey trying ot understand how critical path tracing works. I'm still not 100% so apologies if this is wrong, but would appreciate y'alls input :)

Full context here https://github.com/open-telemetry/opentelemetry-specification/issues/4079

Anyway, I was able to come up with the following minimal repro:

image

This visualization appears to be wrong ^

(JSON)

Per the following code (which produces this trace), /get foo3 is not blocked by /get foo2:

async def make_fake_network_request(resource_name, length):
    with tracer.start_as_current_span(f'get /{resource_name}'):
        await asyncio.sleep(length)

async def foo1():
    await make_fake_network_request("foo1", 0.1)
    await asyncio.sleep(0.1)
    await foo3()

async def foo2():
    await make_fake_network_request("foo2", 0.2)

async def foo3():
    await make_fake_network_request("foo3", 0.3)

@tracer.start_as_current_span("main")
async def main():
    await asyncio.gather(foo1(), foo2())

asyncio.run(main())
full code (including imports) ```python import asyncio from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.trace.export import ( BatchSpanProcessor, SimpleSpanProcessor, ConsoleSpanExporter, ) trace.set_tracer_provider(TracerProvider()) trace.get_tracer_provider().add_span_processor( SimpleSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317")) ) trace.get_tracer_provider().add_span_processor( SimpleSpanProcessor(ConsoleSpanExporter()) ) tracer = trace.get_tracer("async_example") async def make_fake_network_request(resource_name, length): with tracer.start_as_current_span(f'get /{resource_name}'): await asyncio.sleep(length) async def foo1(): await make_fake_network_request("foo1", 0.1) await asyncio.sleep(0.1) await foo3() async def foo2(): await make_fake_network_request("foo2", 0.2) async def foo3(): await make_fake_network_request("foo3", 0.3) @tracer.start_as_current_span("main") async def main(): await asyncio.gather(foo1(), foo2()) asyncio.run(main()) ```

I understand that the algorithm walks backwards, but i'm guessing it can only do so heuristically? Or is there something else i'm missing here? (I don't know if we provide the "previous sibling span id"? Is this an issue with the spec itself (hence the other issue)?)

Thanks!

Jaeger backend version

1.58.0

SDK

python = "^3.12"
opentelemetry-api = "^1.25.0"
opentelemetry-exporter-otlp = "^1.25.0"
opentelemetry-sdk = "^1.25.0"

Pipeline

Operating system

MacOS

Deployment model

Running jaeger-all-in-one locally

yurishkuro commented 2 weeks ago

@dosubot

yurishkuro commented 2 weeks ago

All three foo spans are siblings in your trace (which does not exactly match the code structure where, one could argue, foo3 is a child of foo1 because foo1 doesn't succeed without foo3). There is no "right" way to do critical path when you just have equal siblings, Jaeger simply shows how the execution happened. The most basic definition of critical path is:

a segment is on the CP if making the execution of that segment shorter would make the whole request shorter

I believe Jaeger's depiction meets this definition.

What were you expecting to see?

yurishkuro commented 2 weeks ago

Also, Jaeger is not claiming that foo3 is blocked by foo2, that's just what your trace is telling us has happened.

yurishkuro commented 2 weeks ago

@jkowall Dosu seems not to pick up UI issues, this would've been a good one to LLM.

magicmark commented 2 weeks ago

@yurishkuro

a segment is on the CP if making the execution of that segment shorter would make the whole request shorter

I believe Jaeger's depiction meets this definition.

In this case, making foo2() shorter will not make the whole request shorter. I think this violates the definition?

What were you expecting to see?

The true critical path is foo1() -> foo2(). That would be the critical path I as a human would highlight in the UI. (There would be a bit of unannotated unaccounted for time.)

Jaeger is not claiming that foo3 is blocked by foo2, that's just what your trace is telling us has happened.

Is this what the trace says happened? I looked at the JSON and I don't see that there's any indication that says foo3 is blocked on foo2. The only thing we know is that they're technically all sibling spans of main() in terms of nesting of annotations. (i.e. refType=CHILD_OF. No FOLLOWS_FROM.) But I'm still diving into this for the first time - could you elaborate on where in the JSON that the trace tells jaeger otherwise?

magicmark commented 2 weeks ago

Just to elaborate on this:

All three foo spans are siblings in your trace (which does not exactly match the code structure where, one could argue, foo3 is a child of foo1 because foo1 doesn't succeed without foo3).

fwiw yes I agree that in an idealized world, the foo spans might have their own explicit annotations, making this an easier problem to solve. However this is a minimal repro from an issue i'm seeing at scale, where the only thing we have guaranteed annotations for is outbound network requests, in a sea of deeply nested async code - making it difficult to know what was kicked of where by who. This minimal repro is very representative of a real world problem.

yurishkuro commented 2 weeks ago

In this case, making foo2() shorter will not make the whole request shorter. I think this violates the definition?

You are basing that on your knowledge of the code, but that's not reflected in the trace. Jaeger can only make decision based on trace information, and in that trace the program was first "doing work" in foo2, then some work in main, then foo3.

could you elaborate on where in the JSON that the trace tells jaeger otherwise?

It's in the timing of the span. First there was work in foo2, then in foo3. That's all that Jaeger knows.

As I mentioned, if you want the trace to more accurately reflect the dependencies, you should have foo3 be a child of foo1. It would then affect how the CP is displayed. When spans are just siblings, Jaeger cannot make any assumptions - the could be running in parallel, OR they could have hidden inter-dependencies (e.g. foo1 and foo3 could be true siblings and yet have a logical dependency foo3 follows-from foo1).

magicmark commented 2 weeks ago

You are basing that on your knowledge of the code, but that's not reflected in the trace.

I think that's precisely the point i'm getting at here!

Jaeger fundamentally doesn't have enough information provided in this trace to accurately confidently say "foo2 is in the critical path and i will highlight it as such".

It's in the timing of the span. First there was work in foo2, then in foo3. That's all that Jaeger knows.

I think this minimal repro shows where this assumption that this is enough to calculate CP breaks down when you have concurrent spans.

When spans are just siblings, Jaeger cannot make any assumptions

Per the above, it looks like Jaeger is making assumptions?

To clarify, I think we're in agreement that:

My point is that there doesn't seem to be enough information currently provided in the trace JSON for jaeger to say that "foo2 is in the critical path" - because as we humans can see, it isn't!

Maybe in such cases, Jaeger could color the line differently to highlight this is a heuristic/best guess? I think that would help boost confidence in using this feature (otherwise it's difficult to know when it's accurate or not)

yurishkuro commented 2 weeks ago

I'll ask again - what do you expect to see? CP is a heuristic, and yes Jaeger is making assumptions but based on the data present in the span (which eg. shows foo2 was running solo and therefor WAS on the critical path, same as foo3). If either of those weren't in the program the parent span would've been shorter.

magicmark commented 2 weeks ago

CP is a heuristic

Ah I think "CP is a heuristic" is probably the key bit of information i was missing. TIL. Purely looking at the UI this wasn't clear to me that this should be assumed to be a heurstic.

The only thing I'm able to see on this is the popover:

Screenshot 2024-07-07 at 4 41 13 PM

TBH I don't think it's currently super obvious to users that this is a heuristic.

I'll ask again - what do you expect to see?

Again:

Maybe in such cases, Jaeger could color the line differently to highlight this is a heuristic/best guess? I think that would help boost confidence in using this feature (otherwise it's difficult to know when it's accurate or not)

Additionally:

Is this documented somewhere? I took a look on the docs (https://www.jaegertracing.io/docs/1.18/frontend-ui/) and couldn't see anything related to CP (mentioning that it's a heuristic or otherwise) - maybe I couldn't find the page though?

I would either accept the status quo and:

OR maybe this solvable with a spec change + smarter SDK? https://github.com/open-telemetry/opentelemetry-specification/issues/4079 related

foo2 was running solo and therefor WAS on the critical path

I could shorten or remove foo2 altogether and it wouldn't make the request any shorter. Based on my understanding and your definition above ("if making the execution of that segment shorter would make the whole request shorter"), foo2 is not on the critical path.

yurishkuro commented 2 weeks ago

add a pointer in this page ^ documenting this feature, clarifying that it's a heuristic

we could add a note in the popup

magicmark commented 2 weeks ago

Sounds like a good idea. I would also encourage exploring the idea of coloring the CP differently in cases where there is ambiguity (such as here). In a world where there are only sequential non async spans (and maybe this is provided as an input via the SDK) then the current approach seems sound. In an a microservice environment you may have a mix of both.

yurishkuro commented 2 weeks ago

I wouldn't do different coloring. OTEL data model is fundamentally lacking certain causality information, so the algorithm is always a heuristic, since we never know the true causal graph of events.

Feel free to open a PR to add a warning to the tooltip.