Spans and flame graphs - Githubissues

emilk commented 8 months ago

Currently all our data is associated with a single instance in time - they are events.

There is however many things that require data to span a time range, such as audio and video.

Another useful thing to use spans for is for flame graphs, which is a way to visualize a call graph:

Such a flame graph is useful for profiling, but also for observability, i.e. understanding how a piece is connected.

Implementation

And easy way to implement this is to use a special enum Span { Begin, End } component.

We then have a special time range query that is aware of spans (i.e. querying for data in the time range [#10, #20] must include time ranges that spans [#0, #30]).

Together with a special Flame Graph Space View we have a pretty good start.

Threads and processes

For multi-threaded or multi-processed data we must have one flame-graph per thread:

puffin_egui

We should also be able to record relationships between threads. For instance, we want to be able to see that thread A is blocked waiting on thread B and C (see also https://github.com/EmbarkStudios/puffin/issues/174).

This also means log events should come with a ProcessId and ThreadId component.

API

For Python, a with scope makes sense, as does a function decoration:

@rr.span
def my_function(images):
    for image in images:
        with rr.span(f"image {image.name}"):
            process(image)

…with optional recording argument

In Rust and C++ we would need to use macros, similar to e.g. puffin and loguru.

Implementation

And easy way to implement this is to use a special enum Span { Begin, End } component.

We then have a special time range query that is aware of spans (i.e. querying for data in the time range [#10, #20] must include time ranges that spans [#0, #30]).

Together with a special Flame Graph Space View we have a pretty good start.

I'm not sure I understand why we need to introduce a new/special component for this?

Since the timepoint we have to day is effectively a start timepoint, an alternative implementation I had in mind was to introduce a second, optional timepoint for every log event, which specifies the end timepoint of the event (which therefore becomes a span rather than an event at this point). If the end timepoint isn't specified, then we only look at the start timepoint and consider the event to be instantaneous, as we do today. Otherwise it's a span.

This allows to have spans that cover different time units quite naturally (e.g. "this event spanned 278ms wall-clock time (log_time), 90 simulation ticks (sim_tick) and was instantaneous on the the frame timeline (frame_nr)"). Then I don't think we need to change anything query-wise? Haven't thought about it enough to be sure though.

emilk commented 8 months ago

That is another way of implementing it for sure, but it is quite useful to be able to distinguish an event from a span, and it is also useful to be able to express half-open spans (spans with just a start or just an end).

I envision a flame-graph like view where log events (e.g. text and images) are shown as single point inside the span that contains them.

nikolausWest commented 8 months ago

Maybe we should separate these concepts as:

Event: data + a time point
Duration event: data + start and end time
Span: an operation (unit of work) + a start and end time.
- A hierarchical set of spans make up a trace
  - A trace can be visualized as a flame graph
- Multiple events can be produced within a span

rerun-io / rerun

Spans and flame graphs #4631

Implementation

Threads and processes

API

See also

Implementation