How can we find critically blocking distributed tracing spans? (Should we add `previous_sibling_span_id` as an optional span attribute?)

magicmark commented 2 weeks ago

When analyzing the distributed trace for a page, we often want to understand “what is responsible for making this page slow”?

Specifically – What are the critically blocking spans? This can be quite tricky…

Example

Consider the following simplified trace for a webpage:

We can see here that improving the performance of foo won’t immediately fix the overall TTFB since we’d still be blocked on bar.

If we want to draw a “red line” through the trace representing the critically blocking spans, it’s easy to do so here:

All fine and dandy.

Now how about this one? Good luck :)

All of these foo spans are sibling spans. They all share the same parent (a server span of the request into services/abc).

Consider:

Is foo5 blocked on foo2? Or does it just look that way? Maybe it’s actually blocked on foo3, but we don’t have an annotation for the work happening in between.
Does foo7 block on foo4 or foo6? Or both?!
Can you draw the red line of the critical path of blocking spans?

In a nodejs environment, or python+asyncio, the above scenario is very real.

Can we solve this today?

I don't know! Hence this issue :)

Thoughts:

We only encode dependency information between spans if they’re a parent->child relationship.
In contrast, we don’t* store the dependency relationships between bar, baz and qux below

def foo():
  bar_results = bar()
  baz_results = baz(bar_results)
  return qux(baz_results)

(*we store the time_start and time_end for each span, but that’s not enough to draw a direct relationship)

^ Is this correct? (is there some feature of opentelemetry that i'm missing that provides this insight?)

Thanks!

magicmark commented 2 weeks ago

fwiw, assuming this functionality doesn't yet exist, i could imagine some fancy build-time sourcemap-like solution that looks at the ast and stores the ordering of relationships between sibling function calls. We would then upload this data for a web ui viewer to use, and draw the red line relationships for us.

Prior art: Uploading JS sourcemaps to make stack traces from production readable.

Is anyone aware of any projects to do this, or anything else that solves this problem?

EDIT: maybe https://opentelemetry.io/docs/concepts/signals/traces/#span-links maybe this is what i want? can we encode this in span links? so specifically, this would be rephrased as "can we get the sdk to auto-populate span links data?

akthough https://github.com/jaegertracing/jaeger/issues/4516 seems like even if we did that, it's unclear if it's useful yet with existing open source viewers or not

cartermp commented 2 weeks ago

FWIW I think this is probably more the domain of telemetry backends than OTel itself. I'm aware of at least one (service now observability, formerly lightstep) that has robust critical path calculations in their trace waterfall.

As far as how a backend would do it, I suppose it depends on how pedantic you want to get. In the case of many sibling spans, there is still a critical path, even if fixing it doesn't make the wall-clock time materially faster due to siblings of a similar duration. What you'd want out of that backend is also some way to calculate siblings with a similar duration and highlight them in addition to the critical path.

magicmark commented 2 weeks ago

Thanks @cartermp!

Would be really curious how existing UIs calculate the critical path - the current data provided in spans doesn't seem sufficient to do so accurately as we don't encode causal relationships between sibling spans - only t_start and t_end, which isn't sufficient (hence why I'm wondering if this needs to be part of the spec itself)

cartermp commented 1 week ago

For non-async traces, critical paths are pretty easy to calculate (just walk backwards!) but yes, because there's nothing between peer spans saying that they're peers, you need to "draw" the whole trace and make that observation yourself.

I think the issue here is that you cannot calculate a "true and accurate" critical path. It's really that there's a critical path with a block in the middle of it made up of some sibling spans who may or may not be a part of that critical path in another trace. And practically speaking, even if they were consistently placed within a trace, it's often the case that reducing the latency of one of those operations wouldn't materially impact the total latency because a sibling of a similar duration would take its place.

My view at least is that this calls for two things: a best-effort critical path calculation and a way to visualize places within that path that are impacted by siblings such that they are not necessarily going to truly be a part of that path in other traces.

svrnm commented 1 week ago

Jaeger has critical path support, you might be interested to take a look into the implementation:

https://github.com/jaegertracing/jaeger-ui/pull/1582

Interesting site note: systemd-analyze also has a critical path analysis, you might be interested in that implementation as well: https://www.freedesktop.org/software/systemd/man/latest/systemd-analyze.html

tedsuo commented 1 week ago

@magicmark can you describe a situation where sibling spans are blocking each other?

magicmark commented 1 week ago

@tedsuo

@magicmark can you describe a situation where sibling spans are blocking each other?

Consider the following (synchronous) code:

@trace
def foo():
  bar()
  baz()
  qux()

Assume that foo, bar, baz and qux are all annotated to be traced -- to directly answer your question, and make sure we're on the same page, this is the base case in my mind of "blocking sibling spans"

i.e. you'd get a trace like this:

[foo..................]
   [bar]
        [baz..]
               [qux...]

In this synchronous example, it's easy to work out the critical path and see what blocks what.

But now what about this?

@trace
async def foo():
  await asyncio.gather(bar(), baz())
  qux()

Now it starts to get a bit tricky, you'd see something like this:

[foo..................]
   [bar.......]
   [baz..]
               [qux...]

Now imagine that bar and baz themselves are async methods with their own nested parallelism, and you get a trace like what I posted in the OP. And it becomes very difficult to understand when looking at a large app which functions are the best candidates for optimization by looking at traces.

magicmark commented 1 week ago

Sorry....I think i take this all back actually!

As has been pointed out, jaeger shows the critical path (marked by the black line) out of the box when I download the latest version and run it locally. I wrote some sample code to model the trace in the OP, and it totally works:

repro code fwiw if anyone is interested

``` import asyncio from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.trace.export import ( BatchSpanProcessor, SimpleSpanProcessor, ConsoleSpanExporter, ) trace.set_tracer_provider(TracerProvider()) trace.get_tracer_provider().add_span_processor( SimpleSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317")) ) trace.get_tracer_provider().add_span_processor( SimpleSpanProcessor(ConsoleSpanExporter()) ) tracer = trace.get_tracer("async_example") async def make_fake_network_request(resource_name, length): with tracer.start_as_current_span(resource_name): await asyncio.sleep(length) async def foo5(): await make_fake_network_request("foo5", 0.12) async def foo2(): await make_fake_network_request("foo2", 0.17) await foo5() async def foo6(): await make_fake_network_request("foo6", 0.12) async def foo3(): await make_fake_network_request("foo3", 0.1) await foo6() async def foo7(): await make_fake_network_request("foo7", 0.3) async def foo4(): await make_fake_network_request("foo4", 0.2) await foo7() async def foo1(): await make_fake_network_request("foo1", 0.1) await asyncio.gather(foo2(), foo3(), foo4()) @tracer.start_as_current_span("main") async def main(): await asyncio.sleep(0.1) await foo1() asyncio.run(main()) ```

...i think...this is exactly what i'm looking for? Still not sure how this works exactly and accounts for the different possible interpretations of execution order (see OP) but i'll dig in more to understand.

(Thanks @GLVSKiriti!)

magicmark commented 1 week ago

That said....this is an interesting trace which I'll dig into more

I changed the code such that foo7 actually blocks on foo3, but it still shows that it's blocking on foo4 🤔

``` import asyncio from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.trace.export import ( BatchSpanProcessor, SimpleSpanProcessor, ConsoleSpanExporter, ) trace.set_tracer_provider(TracerProvider()) trace.get_tracer_provider().add_span_processor( SimpleSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317")) ) trace.get_tracer_provider().add_span_processor( SimpleSpanProcessor(ConsoleSpanExporter()) ) tracer = trace.get_tracer("async_example") async def make_fake_network_request(resource_name, length): with tracer.start_as_current_span(resource_name): await asyncio.sleep(length) async def foo5(): await make_fake_network_request("foo5", 0.12) async def foo2(): await make_fake_network_request("foo2", 0.17) await foo5() async def foo6(): await make_fake_network_request("foo6", 0.12) async def foo3(): await make_fake_network_request("foo3", 0.1) await asyncio.sleep(0.1) await foo7() async def foo7(): await make_fake_network_request("foo7", 0.3) async def foo4(): await make_fake_network_request("foo4", 0.2) # await foo7() async def foo1(): await make_fake_network_request("foo1", 0.1) await asyncio.gather(foo2(), foo3(), foo4()) @tracer.start_as_current_span("main") async def main(): await asyncio.sleep(0.1) await foo1() asyncio.run(main()) ```

cartermp commented 1 week ago

Yeah, the way most critical path systems work is they work backwards from the end of the trace and (sometimes via recursion) figure out what's most blocking. Uber has a good post on it here: https://www.uber.com/blog/crisp-critical-path-analysis-for-microservice-architectures/

However, as you're finding, when async spans or "interesting groups of siblings" are concerned, it's not always the same path from trace to trace and there may be some inconsistencies.

magicmark commented 1 week ago

@cartermp thanks! So I haven't fully dug into it yet, but I'm wondering if i'm just running into what I originally suspected which is that the trace data fundamentally doesn't provide enough information to accurately work out the critical path.

Working backwards makes sense, but in an async world, you effectively have to "guess" which was the blocking previous span? And sometimes we get it wrong, as seen above? (Minor discrepancies between difference traces for the same page aside, this seems like a reproducible issue)

magicmark commented 1 week ago

To confirm, this is my minimal repro:

This visualization is wrong ^

/get foo3 is not blocked by /get foo2

Produced by this code:

async def make_fake_network_request(resource_name, length):
    with tracer.start_as_current_span(f'get /{resource_name}'):
        await asyncio.sleep(length)

async def foo1():
    await make_fake_network_request("foo1", 0.1)
    await asyncio.sleep(0.1)
    await foo3()

async def foo2():
    await make_fake_network_request("foo2", 0.2)

async def foo3():
    await make_fake_network_request("foo3", 0.3)

@tracer.start_as_current_span("main")
async def main():
    await asyncio.gather(foo1(), foo2())

asyncio.run(main())

full code (including imports)

```python import asyncio from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.trace.export import ( BatchSpanProcessor, SimpleSpanProcessor, ConsoleSpanExporter, ) trace.set_tracer_provider(TracerProvider()) trace.get_tracer_provider().add_span_processor( SimpleSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317")) ) trace.get_tracer_provider().add_span_processor( SimpleSpanProcessor(ConsoleSpanExporter()) ) tracer = trace.get_tracer("async_example") async def make_fake_network_request(resource_name, length): with tracer.start_as_current_span(f'get /{resource_name}'): await asyncio.sleep(length) async def foo1(): await make_fake_network_request("foo1", 0.1) await asyncio.sleep(0.1) await foo3() async def foo2(): await make_fake_network_request("foo2", 0.2) async def foo3(): await make_fake_network_request("foo3", 0.3) @tracer.start_as_current_span("main") async def main(): await asyncio.gather(foo1(), foo2()) asyncio.run(main()) ```

^ It seems that when the algorithm walks backwards, it can only do so heuristically? In this case, this highlights the wrong path.

Hence this as Github issue on the spec itself - I don't think we don't provide enough data to do this reliably?

(I could imagine that by providing extra data in spans (specifically previous_sibling_span_id), and making the sdk do this work, this visualization would be possible)

open-telemetry / opentelemetry-specification

How can we find critically blocking distributed tracing spans? (Should we add `previous_sibling_span_id` as an optional span attribute?) #4079

Example

Can we solve this today?