subscriber: Enable `Layer`s to read a recently-closed span

davidbarsky commented 4 years ago

(This is followup from #420)

Layer::on_close should be able to read the just closed span(s). At the moment, it's not possible to see a fully-completed span & export span's data, as the registry is not guaranteed to have the just-closed span in question.

davidbarsky commented 4 years ago

I know that @hawkw had a lot of opinions and thoughts as to how this should be implemented; far more than me.

hawkw commented 4 years ago

Having given this some thought, it turns out to be a surprisingly tricky problem. Here's some of what I've been thinking:

Goals

The specific functionality we want here is that when a span closes, the Layers handling an on_close callback should be able to access that span's data before it is removed from the registry.

Some requirements:

"Spans should close when they close": What I mean by this is: if a parent span is closed because a child closed, decrementing its ref count, the closure should occur as close as possible to when the child closed. If an on_close method records a timestamp for when a span closed, those timestamps should be accurate.
Spans close in order: the on_close method should not be called for a parent span until the on_close callback for its parent has been completed.

There are also some non-functional requirements:

User code (i.e. Layer implementations) should not have to be aware of this. i.e., there should not be a special Context method that layers have to call to say "I'm done handling an on_close" or anything like that
Ideally we should be able to implement this just in tracing-subscriber. We shouldn't have to modify the Subscriber trait or anything else in tracing-core unless absolutely necessary.
Add as little overhead as possible

Non-goals

The closed span does not (and probably should not) be visible to threads other than the one where the close occurred.
While the close event needs to be processed immediately, storage used by the closed span does not need to be freed immediately. It would be okay to defer removing the slab slot used by a closed span, as long as it's released by the time the next new span is created.

Potential Solutions

Ideally, we would have some method on Registry which is called when all on_close callbacks are done, indicating that a span closure has been processed. Then, the span can be dropped. There are two major challenges to this:

How would we expose this method? In general, layers are written against a S: Subscriber. In some cases, the subscriber is required to implement LookupSpan, but that bound is introduced by the type implementing Layer, and the Layered struct cannot expect it. However, if we don't want user Layers to have to manually call this method, Layered would need to do it automatically. If Rust had specialization, we could add a different implementation when the root subscriber is a Registry or implements some trait, but this is not possible today.
How do we determine when all layers have finished processing the close? Either it needs to be called when the stack's Subscriber::try_close impl is complete, or, we need some way of knowing which layer is the last in the stack.

Here are some potential approaches:

The One With The `tracing-core` Change

One solution is to add a new method to the Subscriber in tracing-core that is called when a span close is complete — something like close_finished or similar. Then, we could change Dispatch::try_close to something like:

    pub fn try_close(&self, id: span::Id) -> bool {
        let closed = self.subscriber.try_close(id.clone());
        if closed {
            self.subscriber.finish_close(id);
        }
        closed
    }

This way, the method would always be called when a close is done, without any complicated or slow counting of layers.

In order to avoid a breaking change, the close_finish method would need a default implementation that does nothing. Subscriber implementations would opt in to running behavior in this callback.

Pros

fairly simple
low overhead
adds a new general purpose API that other subscriber implementations could use

Cons

requires a tracing-core change. we should avoid this if at all possible.
adds another close-related method to the already-complex Subscriber trait.
adds another default impl to Subscriber. yay, more stuff to forget to override!

The One With The `downcast_ref`

To avoid a tracing_core change, we could add behaviour to the Layered type instead. It could call downcast_ref on the inner subscriber after a close, and if it is able to downcast that subscriber to a Registry, calls a special method on Registry.

We would need to figure out a solution for determining which layer is the "last". One potential solution is having a counter on the registry which is incremented at the start of every Layered::try_close call, and decremented when that call finishes. When the count reaches 0 again, all the layers have processed the close. Registry could expose private methods for incrementing/decrementing this ref count (via a drop guard?),

Pros

invisible to user code
no tracing-core change

Cons

specific to Registry
downcasting in every call to Layered::try_close adds overhead
counting calls adds some overhead
not applicable to other subscribers (which may not be a con)
special-cases Registry in Layered (although the code could at least be feature flagged)

The One With The `downcast_ref` Misuse

As a performance optimization, the previous approach could be modified slightly by adding a newtype that represents a function called to begin/end a close. The Registry could override downcast_raw to allow "downcasting" to that newtype, and return a function pointer. Rather than having Layered perform this downcasting on every close call, it could do this when it is initially wrapping Subscriber, by trying to downcast the wrapped subscriber to get the function pointer. Then, Layered could store the function pointer internally, and call it on every try_close.

This has most of the same pros and cons as the previous solution, with the following differences:

Pros

doesn't require downcasting on every call

Cons

probably involves some scary-looking unsafe code
definitely an abuse of the downcast_ref API, but we're the maintainers and we're allowed to do Bad Things that you definitely shouldn't try at home. depending on your perspective, this might be a pro ;)

I'll keep thinking about this, and I may come up with more proposed solutions, but here's something to start with.

hawkw commented 4 years ago

It's probably best to start with a solution that doesn't require a tracing-core change. If there ends up being a need for this type of thing outside of Registry, then we can consider adding it to core.

davidbarsky commented 4 years ago

Closed via https://github.com/tokio-rs/tracing/commit/5d2ebc6ed7fb2976258f895f49f477f099258fcd.

tokio-rs / tracing