Client never calls `___tracy_shutdown_profiler`

tracy_client::Client doesn't provide any way to call ___tracy_shutdown_profiler, which leads to incomplete traces and complaints from the Tracy GUI:

I gather from other issues that there was previously some reference counting magic somewhere but it got ripped out, so I can believe it's difficult to tell when it is safe to call the shutdown function. But it does seem like not cleaning things up causes problems.

I'd love to hear more about the background than I could glean from reading tracy-client/src/state.rs.

https://github.com/nagisa/rust_tracy_client/blob/53fa375971520bfd2e9c8fd80f292ed64d15c263/tracy-client/src/state.rs#L27-L28

is the relevant comment. I don't recall seeing this message before, but since you mention problems, I guess a question worth asking is whether there are truly missing source locations, unresolved stack frames, etc, that you're seeing?

Regardless it seems quite plausible to pursue implementing a dedicated lifetime management thread if only to get rid of this message. I can’t imagine an use-case for setting up and tearing down the client frequently enough that the thread setup costs would become unbearable.

I don't know, exactly. I was profiling wgpu's "water" example, which was reaching 700FPS on my system and spewing out quite a lot of data. In the end more than half of the profile was received by Tracy after I hit ESC to exit the demo.

Looking at the Tracy v0.10 sources (tracy-client-sys v0.22), it seems like none of this should matter if you don't select the delayed-init and manual-lifetime features. Without those, it seems like Tracy v0.10 will take care of all this initialization itself. Would it make sense for tracy-client to simply not request those features?

https://github.com/nagisa/rust_tracy_client/issues/29 is the relevant reading for why manual-lifetime is required.

I see. One last question: state.rs says:

Alas, Tracy's extensive use of thread-local storage presents us with another problem – we must start up and shut down the client within the same thread.

How does this requirement arise? I see that tracy::ShutdownProfiler does use some thread_local variables, but not in a way that would cause a problem if StartupProfiler and ShutdownProfiler were called on different threads.

I don't exactly remember all the details since it was all implemented almost two years ago at this point, but my memory suggests this problem/requirement was established experimentally – originally we had implemented proper reference counting and would properly terminate when the counter reached zero and all that, yet things would still explode. Quite likely somewhere in the rpmalloc allocator that it uses. I also mildly remember repeat initialization (i.e. initialize, use, deinitialize, initialize, use…) being a problematic case.

It would have been a good idea for the past me to create a commit with a reproduction and reference it in the comment in question. Alas.

(For the non-manual-lifetime variant this is not a problem, because initialization happens on the main thread in life-before-main, and deinitialization in an on_exit handler which also runs on the main thread)

Either way lets keep this issue open, maybe somebody will come up with a neat solution here.

Ok, here are options I've thought about and why they don't work.

Use a global, atomic is_stopping flag.
- Once a tracy call has started, it's possible to begin a shutdown race since the check for shutting down will have already occurred. Thus, any form of stopping flag is fundamentally broken.
Augment the prior strategy with reference counting and make every tracy API increment the ref count before the call and decrement after.
- This will have garbage performance because every thread will be contendening to write the one cache line used for the atomic add/sub.
Implement some form of reference counting at the Client creation/destruction level.
- Doesn't work because tracing::subscriber::set_global_default keeps a ref to the client forever for example. Thus, the notion of having access to tracy using the lifetime of a rust struct is broken.

A plausible solution looks like this:

Initialize tracy on the main thread pre-main using https://crates.io/crates/ctor.
Keep a global reference count of non-main threads that have accessed tracy.
- Use a thread local for this with a destructor that decrements the ref count. Note that not all platforms support TLS destructors, but oh well.
- When the count reaches zero, send a condvar notification.
Shutdown tracy on the main thread using a ctor destructor.
- Open a global mutex lock to wait for the other threads to notify it that they've all exited.

The issue here is that I'm pretty sure we'll deadlock waiting for a daemon thread since there's no requirement to eventually terminate non-main threads.

Thus, an actually working solution moves away from the client model:

Every tracy API call with need to use an atomic "active" variable (basically a +1 only ref count).
The ref count is stored in a thread local Arc which is registered with the global active thread list (using a normal Mutex and Vec since this happens once on thread creation/destruction).
On shutdown, we flip a global atomic shutdown switch which each API call checks.
- The shutdown process scans the active thread list, spin looping until every thread has exited its tracy API call.

A fully flexible solution augments the prior one:

Spawn a background worker thread on which to do tracy startups and shutdowns.
- This thread is an actor, using a globally accessible message channel to receive startup and shutdown requests.
- Both start and stop messages take in an Arc-ed mutex to support waiting for completion of the request. (Though IIRC there's also a crate that supports one-time usage channels, so you could supply it with the Sender side of the once channel.)
Add a feature flag to start tracy on the first attempt to use it instead of before main.
Offer an API to explicitly shutdown tracy but also do so after main as usual.
Instead of a global shutdown flag, use a global state flag: enum State { Starting, Running, Stopping, Stopped }. Spin loop when trying to move to a certain state but currently in Running, Stopping.

I'm going to do some benchmarking of single-threaded atomics and thread locals to see what the costs would be. Currently, it's 1 thread_local access, 1 atomic read, and 2 atomic writes (all effectively single-threaded) per tracy API call.

If that turns out to be slow, I'd vote for just accepting that we'll segfault if people have background threads using tracy after main cuz this is just a pain. One sad way to resolve things is to provide an unsafe shutdown function and just tell people to call it at the end of main, but this won't even work for apps like bevy which don't have access to main.

Note that this bug must be fixed because as https://github.com/nagisa/rust_tracy_client/issues/80 showed the traces will be in a broken state if you don't shut down properly, so this library is useless for non-long running applications as is.

Thus, an actually working solution moves away from the client model

Hm, given that we’d probably like to support both manual-lifetime and non-manual-lifetime modes of operation in tracy-client, I suspect that we will have to keep Client abstraction around regardless. Outside of tracing-tracy, it is perfectly plausible for the users to ensure that they cleanly terminate threads and terminate the the manually initialized client. In the life-before-main mode of operation we can drop any lifetime accounting entirely (just like we do in not(enable) configuration.)

Every tracy API call with need to use an atomic "active" variable (basically a +1 only ref count).

The ref count is stored in a thread local Arc which is registered with the global active thread list (using a normal Mutex and Vec since this happens once on thread creation/destruction).

On shutdown, we flip a global atomic shutdown switch which each API call checks.

The shutdown process scans the active thread list, spin looping until every thread has exited its tracy API call.

One detail is that we will need to make sure that those tracy API calls also spin loop (or handle the situation in some other way) if the shutdown switch has been flipped.

Some of these ideas evoke in my brain a pattern match with the lock-free memory reclamation solution space. One paper on the topic that I read had an interesting alternative to flags that I think could be a potential option here as well – scanning stacks for pointers from a thread responsible for the memory reclamation.

In our case we could have the wrapper around __tracy_shutdown grab call stacks for all of the threads and check whether any of them are in a __tracy_* call. Shutdown, after all, is not timing sensitive to the nanosecond scale. This could replace the per-thread "inside tracy" flag and possibly reduce the overhead of the hot path calls.

We could also set up a SIGSEGV handler and just stop the threads that have SIGSEGVd during a shutdown… This wouldn’t be great for manual-lifetime use though, and I worry that just :see_no_evil:-ing this will lead to other problems (such as some sort of deadlock inside the tracy guts.)

If that turns out to be slow, I'd vote for just accepting that we'll segfault if people have background threads using tracy after main cuz this is just a pain.

Fair. A big downside is that it can lead to non-deterministic loss of profile data just as well (though I guess that is still a better place to be than a guaranteed loss of profile data :D)

I would actually take a initial PR to make manual-lifetime mode optional right now, and then we can see (or not!) if we want to bother with implementing something more fancy on top.

It would be really nice to uncover the original cause of the crashes when starting and shutting down on different threads. If the Tracy maintainer did not intend to be imposing that requirement (it's not documented), then they might take a bug report or PR.

I'd certainly want confirmation that the requirement was intentional before implementing a complex workaround, rather than fixing the bug at the source.

(Confession: I'm talking big but can't commit my own time to this. Obnoxious.)

like to support both manual-lifetime and non-manual-lifetime modes of operation

Agreed.

I suspect that we will have to keep Client abstraction around regardless.

My argument is more that we should use top-level functions instead of putting them in a Client since the lifetime of Client can't be tied to that of tracy once you're allowed to shut it down.

those tracy API calls also spin loop

Right, the idea is that they would try to start tracy and end up waiting for it to shut down first.

lock-free memory reclamation solution space

I think the main problem here is that we can only have one instance of tracy around. Otherwise yeah it would be really nice to do epoch garbage collection.

In our case we could have the wrapper around __tracy_shutdown grab call stacks for all of the threads

This sounds pretty cool, but I have no idea how you would implement it practically. Assuming there's an API to hook into thread startup so we can register the stack address (couldn't find one with a quick search), we still need some way to know where the stack ends which means we probably need to freeze the remote thread and grab its rbp pointer (no idea what APIs we could use to do that from within a process without ptrace). Then we also need to ensure all of our tracy-client APIs are inline(never) and use linker shenanigans to get the start and end addresses of all our API functions. Once we know a thread's rip register is in one of those ranges, we need some heuristic to run it for some arbitrary amount of time hoping it will exit the tracy APIs. Oh and presumably we need to not pause the tracy threads because some APIs could wait for a response and we'd deadlock.

Basically this sounds clever, but extremely difficult to implement without operating system support. Or maybe I'm overcomplicating things?

I would actually take a initial PR to make manual-lifetime mode optional right now, and then we can see (or not!) if we want to bother with implementing something more fancy on top.

I'd like to do some benchmarks first, but this might be the way to go.

I'd certainly want confirmation that the requirement was intentional before implementing a complex workaround, rather than fixing the bug at the source.

Yeah, that's a good idea, though honestly I don't think it helps us all that much. The fundamental problem here is the combination of two things:

People can store things that use tracy inside statics (whose destructors aren't run on exit) which means we can't use an Arc to know when to shut down.
You can shut down while other threads want to use tracy APIs.

Being able to shut down in any thread would make things a little simpler, but not by much unfortunately (AFAIK).

Finally got around to benchmarking things. It looks like we'll be adding (optimistically) 10-15ns per call:

Total: 887459712
Time: 1.231335103s
Time per iter: 12ns

--- BASELINE ---

Total: 887459712
Time: 60.816923ms
Time per iter: 0ns

Overhead: 1.17051818s, per iter: 11ns
De-register thread

I also benchmarked the cost of span_alloc and it's ~7µs with stack_depth=32 and ~280ns with no stack traces. The cost of exiting a span is ~100ns.

Those numbers seem kinda bad to me. For span exits, that means we'll be paying a 10% penalty. Not great.

Also, it occurred to me that we can still break things in weird ways even if we completely handle startup and shutdown correctly. For example, suppose the main thread is exiting so it shuts down tracy, but a worker thread is in the middle of a span. When the span exists, tracy will get garbage because it'll be started up again and see a span that doesn't exist. I tested what actually happens when you do this and tracy freezes and then segfaults. So we didn't actually fix anything. :confused: We could handle this by including a global shutdown count and having each thread record the last one it saw to ignore invalid span exits. But how do you do this for arbitrary user code we don't control? Might be doable, but it sounds complicated and slow.

I've come to the conclusion that user code which continues executing after main is fundamentally broken and not our problem. Here's what I propose:

Default to the standard tracy initialization.
Get rid of all the client abstractions and make users call the unsafe startup and shutdown functions themselves when the manual lifetimes feature is enabled. Basically startup and shutdown is not our problem and we tell users they have to manage tracy startup/shutdown themselves if they run code after main. All worded more nicely obviously. :grin:

Thoughts?

Appendix: source code dump

use std::{
    ops::Div,
    sync::{
        atomic::{AtomicU32, Ordering::Relaxed},
        Arc,
    },
    time::Instant,
};

struct TlsState {
    is_using_api: Arc<AtomicU32>,
}

impl Drop for TlsState {
    fn drop(&mut self) {
        println!("De-register thread");
    }
}

static GLOBAL_STATUS: AtomicU32 = AtomicU32::new(0);
thread_local! {
    static STATE: TlsState = {
        let refs = Arc::new(AtomicU32::new(0));
        std::hint::black_box(refs.clone()); // Register with global state
        TlsState { is_using_api: refs }
    };
}

fn main() {
    let iters = 100_000_000;

    let start = Instant::now();
    let mut total = 0u32;
    for i in 0..iters {
        std::hint::black_box(GLOBAL_STATUS.load(Relaxed)); // Ensure tracy is initialized
        total = total.wrapping_add(STATE.with(|s| {
            s.is_using_api.fetch_add(1, Relaxed); // Incr ref count
            let result = std::hint::black_box(i);
            s.is_using_api.fetch_sub(1, Relaxed); // Decr ref count
            result
        }));
    }
    let atomic_time = start.elapsed();
    println!(
        "Total: {total}\nTime: {:?}\nTime per iter: {:?}",
        atomic_time,
        atomic_time.div(iters)
    );

    let start = Instant::now();
    let mut total = 0u32;
    for i in 0..iters {
        total = total.wrapping_add(std::hint::black_box(i));
    }
    let baseline_time = start.elapsed();
    println!("\n--- BASELINE ---\n");
    println!(
        "Total: {total}\nTime: {:?}\nTime per iter: {:?}",
        baseline_time,
        baseline_time.div(iters)
    );

    let overhead = atomic_time - baseline_time;
    println!(
        "\nOverhead: {:?}, per iter: {:?}",
        overhead,
        overhead.div(iters)
    );
}

Subject: [PATCH] Benchmarking
---
Index: tracing-tracy/src/lib.rs
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/tracing-tracy/src/lib.rs b/tracing-tracy/src/lib.rs
--- a/tracing-tracy/src/lib.rs  (revision 20b608cff68edbd8ed4039f97388be696e3211e0)
+++ b/tracing-tracy/src/lib.rs  (date 1705193472169)
@@ -49,7 +49,9 @@
 #![doc = include_str!("../FEATURES.mkd")]
 #![cfg_attr(tracing_tracy_docs, feature(doc_auto_cfg))]

+use std::cell::RefCell;
 use std::sync::atomic::{AtomicUsize, Ordering};
+use std::time::{Duration, Instant};
 use std::{fmt::Write, mem};
 use tracing_core::{
     field::{Field, Visit},
@@ -240,6 +242,7 @@

             event.record(&mut visitor);
             if !visitor.first {
+                let start = Instant::now();
                 self.client.message(
                     self.truncate_to_length(
                         (u16::MAX - 1).into(),
@@ -248,6 +251,32 @@
                     ),
                     self.stack_depth,
                 );
+                let time = start.elapsed();
+
+                #[derive(Default)]
+                struct Stats {
+                    total_duration: Duration,
+                    total_calls: u32,
+                }
+
+                impl Drop for Stats {
+                    fn drop(&mut self) {
+                        println!(
+                            "message durations: {:?}",
+                            self.total_duration / self.total_calls
+                        );
+                    }
+                }
+
+                thread_local! {
+                    static STATS: RefCell<Stats> = RefCell::new(Stats::default());
+                }
+
+                STATS.with(|s| {
+                    let mut s = s.borrow_mut();
+                    s.total_duration = s.total_duration.checked_add(time).unwrap();
+                    s.total_calls += 1;
+                });
             }
             if visitor.frame_mark {
                 self.client.frame_mark();
@@ -265,21 +294,47 @@
             let file = metadata.file().unwrap_or("<not available>");
             let line = metadata.line().unwrap_or(0);
             let span = |name: &str| {
-                (
-                    self.client.clone().span_alloc(
-                        Some(self.truncate_span_to_length(
-                            name,
-                            file,
-                            "",
-                            "span information is too long and was truncated",
-                        )),
-                        "",
-                        file,
-                        line,
-                        self.stack_depth,
-                    ),
-                    id.into_u64(),
-                )
+                let start = Instant::now();
+                let zzz = self.client.clone().span_alloc(
+                    Some(self.truncate_span_to_length(
+                        name,
+                        file,
+                        "",
+                        "span information is too long and was truncated",
+                    )),
+                    "",
+                    file,
+                    line,
+                    self.stack_depth,
+                );
+                let time = start.elapsed();
+
+                #[derive(Default)]
+                struct Stats {
+                    total_duration: Duration,
+                    total_calls: u32,
+                }
+
+                impl Drop for Stats {
+                    fn drop(&mut self) {
+                        println!(
+                            "span_alloc durations: {:?}",
+                            self.total_duration / self.total_calls
+                        );
+                    }
+                }
+
+                thread_local! {
+                    static STATS: RefCell<Stats> = RefCell::new(Stats::default());
+                }
+
+                STATS.with(|s| {
+                    let mut s = s.borrow_mut();
+                    s.total_duration = s.total_duration.checked_add(time).unwrap();
+                    s.total_calls += 1;
+                });
+
+                (zzz, id.into_u64())
             };

             match fields {
@@ -319,7 +374,34 @@
                     self.stack_depth,
                 );
             }
+            let start = Instant::now();
             drop(span);
+            let time = start.elapsed();
+
+            #[derive(Default)]
+            struct Stats {
+                total_duration: Duration,
+                total_calls: u32,
+            }
+
+            impl Drop for Stats {
+                fn drop(&mut self) {
+                    println!(
+                        "Span::drop durations: {:?}",
+                        self.total_duration / self.total_calls
+                    );
+                }
+            }
+
+            thread_local! {
+                static STATS: RefCell<Stats> = RefCell::new(Stats::default());
+            }
+
+            STATS.with(|s| {
+                let mut s = s.borrow_mut();
+                s.total_duration = s.total_duration.checked_add(time).unwrap();
+                s.total_calls += 1;
+            });
         } else {
             self.client.color_message(
                 "Exiting a tracing span, but got nothing on the tracy span stack!",

Subject: [PATCH] Spans
---
Index: tracing-tracy/src/lib.rs
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/tracing-tracy/src/lib.rs b/tracing-tracy/src/lib.rs
--- a/tracing-tracy/src/lib.rs  (revision 20b608cff68edbd8ed4039f97388be696e3211e0)
+++ b/tracing-tracy/src/lib.rs  (date 1705194459852)
@@ -67,6 +67,7 @@
 use utils::{StrCache, StrCacheGuard, VecCell};

 pub use client;
+use client::sys::___tracy_shutdown_profiler;

 thread_local! {
     /// A stack of spans currently active on the current thread.
@@ -319,7 +320,11 @@
                     self.stack_depth,
                 );
             }
+            dbg!("here");
             drop(span);
+            unsafe {
+                ___tracy_shutdown_profiler();
+            }
         } else {
             self.client.color_message(
                 "Exiting a tracing span, but got nothing on the tracy span stack!",

    let gen_span = tracing::span!(tracing::Level::TRACE, "dir_gen");
    let gen_span2 = tracing::span!(tracing::Level::TRACE, "dir_gen");
    let gen_span = gen_span.enter();
    let gen_span2 = gen_span2.enter();
    println!("yo");
    drop(gen_span2);
    drop(gen_span);
    std::process::exit(0);

I’m not sure why the insistence on removing the Client wrappers. What’s wrong with an approach like e.g. what I drafted up in #92? Especially since it allows us to keep the uniform "safe" API regardless of manual vs. fancy-life-before-main-shenanigans modes.

Anyway this feels like its just gonna remain one of those things that will keep being a pain point no matter how we serve the dish. That’s just the life you gotta live when integrating with external projects without much influence on their behaviour or direction :) Sounds to me like I’ll just end up being forced to create a pinned issue to stop people from reporting about occasional crashes in their instrumented code. Perhaps that’s good enough?

I’m not sure why the insistence on removing the Client wrappers. What’s wrong with an approach like e.g. what I drafted up in https://github.com/nagisa/rust_tracy_client/pull/92?

Nothing wrong with #92, I think I'm just not seeing the purpose of Client: why not have devs call startup when they need it? Or is the idea that people want the lazy initialization? That's perfectly valid, but I thought the reason for introducing Client was to avoid segfaults not delay initialization.

Anyway this feels like its just gonna remain one of those things that will keep being a pain point no matter how we serve the dish.

Yup, agreed lol.

Sounds to me like I’ll just end up being forced to create a pinned issue to stop people from reporting about occasional crashes in their instrumented code. Perhaps that’s good enough?

Yeah, it kind of has to be. Though we can maybe also document this stuff in the README?

With the most recent releases you should get the more usual tracy experience out of the box. I imagine that if the default configuration is not enough to get rid of this message, then enabling the flush-on-exit feature should definitely take care of it (as long as you don't tend to have random stray detached threads that are trying to talk tracy.)

@nagisa thanks for working with me through all this! I'm really happy with where we're at. :)

nagisa / rust_tracy_client

Client never calls `___tracy_shutdown_profiler` #78

Appendix: source code dump