tokio-rs / tracing

Application level tracing for Rust.
https://tracing.rs
MIT License
5.15k stars 673 forks source link

core: fix missed `register_callsite` error #2938

Open hds opened 2 months ago

hds commented 2 months ago

Note: This is a change against the v0.1.x branch.

Motivation

There are 2 triggers which will cause a subscriber to receive a call to Subscriber::register_callsite for a specific callsite.

  1. The first time the event or span at that callsite is executed.
  2. When a new subscriber is added or removed (for example, calls to set_default or with_default)

It is trigger (2) that will cause a new subscriber to receive Subscriber::register_callsite for all the callsites which had already been registered before it became active.

When a callsite is registered for trigger (1), the callsite starts in state UNREGISTERED.

The first thread to encounter the callsite will transition it to REGISTERING and determine the overall interest for the callsite by registering with all known dispatchers (which will call into Subscriber::register_callsite).

Once that is complete, the callsite is added to the list of all known callsites and its state is transitioned to REGISTERED.

is (re)built for all known dispatchers. The callsite starts in state UNREGISTERED. The This calls down into Subscriber::register_callsite for each subscriber. Once that is complete, the callsite is added to the global list of known callsites.

While the callsite interest is being rebuilt, other threads that encounter the callsite will be given Interest::sometimes() until the registration is complete. However, if a new subscriber is added during this window, all the interest for all callsites will be rebuilt, but because the new callsite (in state REGISTERING) won't be included because it isn't yet in the global list of callsites.

This can cause a case where that new subscriber being added won't receive Subscriber::register_callsite before it receives the subsequent call to Subscriber::event or Subscriber::new_span.

The documentation on Registering Callsites is not very explicit on this point, but it does suggest that Subscriber::register_callsite will be called before the call to either Subscriber::event or Subscriber::new_span, and the current behavior can break this implicit contract.

Solution

This change swaps the order of rebuilding the callsite interest and adding the callsite to the global list so that the callsite gets pushed first, avoiding this window in which a subscriber won't get a call to register_callsite.

As such, a callsite may have its interest read before it is set. In this case, the existing implementation will return Interest::sometimes() for the DefaultCallsite implementation. Other implementations (outside of the tracing project) may perform this differently, but in this case, there is no documented guarantee regarding the ordering.

A regression test is included which provokes the race condition 100% of the time before the changes in this fix.

Fixes: #2743

hds commented 2 months ago

@hawkw I'd love to get your opinion on the change I've made in the callsite module. It seems fine for me (and certainly no tests failing), but I think you're probably the only person in the world who would know.

The test in tracing can probably go, and I might need another test in tracing-core to cover the dyn Callsite case. Also, interestingly, I couldn't reproduce this issue on master.

hawkw commented 2 months ago

@hawkw I'd love to get your opinion on the change I've made in the callsite module. It seems fine for me (and certainly no tests failing), but I think you're probably the only person in the world who would know.

Thanks for the ping! I'll try to give it a look today or tomorrow.

mladedav commented 2 months ago

@hds you mentioned the docs not being explicit on the topic, could you maybe clarify them? You just might be the second most knowledgeable person on the topic now.