Rust native re-implementation of parts of Tracy #109

jasontt commented 3 months ago

As mentioned in my TRACY_NO_VERIFY PR - I have been exploring alternate methods for writing Spans to Tracy other than via their C API and the C bindings provided by this library.

My current approach to this has been to git sub-module Tracy and use cxx to create a new C API from Tracy's public C++ API. My main motivation for this is to address a couple of drawbacks of the current approach:

  1. We are paying a function indirection in every Span due to the lack of in-lining
  2. Tracy's C API is fundamentally not as safe/sound/performant/ergonomic as what we could build in Rust a. Profiling is a canonical use case for non temporal writes which Tracy does not currently make use of b. Rust can convert thread local statics to regular statics when it knows there is no threading (this will not work over FFI) c. As discussed in the aforementioned PR we could adjust our own spans to only commit when they are dropped instead of performing two separate operations

Now while it is perfectly viable for me to continue with this approach, it may be nice to centralise the binding of Tracy in tracy-client-sys, additionally I can imagine users potentially wanting to mix and match features from tracy-client/tracy-client-sys with Span logic I've built in my own crate - doing this without a common dependency on Tracy feels fragile. Thus I'm wondering if this project has any opinions on this matter? Perhaps we could separate libTracyClient into a new crate?

nagisa commented 3 months ago

My immediate observation is that inlining with LTO is going to work just fine. Even if there's use of dynamic dispatch for any reason, devirtualization can help in some cases. The major cost centre is rather the allocation/handling of srclocs…

I wouldn't mind experiments trying to bind the C++ API instead, though. If nothing else, the C++ API is more of a first class citizen, but I don't expect there to be any meaningful performance implications here.

jasontt commented 3 months ago

So to be clear the Tracy C API is not inlined into our spans regardless of your Cargo profile settings. You can objdump the resulting binary and clearly see the jumps in the assembly. As far as I can tell this seems to be expected when using functions via extern C in Rust but I would be very happy to be proven otherwise. If I add this function to tracy-client/benches/ you can see this clearly along with some other interesting things of note.

pub fn single_span() {
    let _ = tracy_client::span!("single_span", 0);
        push rax
        mov rax, qword ptr [rip + client::single_span::LOC+48]
        cmp rax, 2
        jne .LBB245_1
        lea rdi, [rip + client::single_span::LOC+16]
        mov esi, 1
        call qword ptr [rip + ___tracy_emit_zone_begin@GOTPCREL]
        mov rdi, rax
        pop rax
        jmp qword ptr [rip + ___tracy_emit_zone_end@GOTPCREL]
        lea rdi, [rip + client::single_span::LOC]
        mov rsi, rdi
        call once_cell::imp::OnceCell<T>::initialize
        jmp .LBB245_2

So what's going on here:

  1. For 0 callstack depth there is no allocation after the first call to OnceCell<T>::initialize, we get an extra cmp and jne but nothing major
  2. We can see the call/cjmp into Tracy's C API
  3. We can actually see in terms of the code we're generating there's actually not much less we could be doing

This very well may be different for some of the other Spans (non zero callstack depth etc) as I have not extensively looked into these but as far as our fastest spans are concerned the only optimisations that we can really make would be to:

  1. make the LOC entirely static albeit this is very minor
  2. optimise our usage of Tracy / Tracy itself

I've done some of my own experiments into 2. which I will aim to report back here once I've bottomed out my ideas a bit more. I have some code at the moment that does all spans in native rust and dumps into the Tracy client at program close but I'm not sure if/how that fits in with this crate at this stage.

Ralith commented 3 months ago

From an old blog post it sounds like cross-language LTO should be feasible but may be a bit fragile. Perhaps some adjustment is needed to how the C/C++ code is built?

jasontt commented 3 months ago

It’s definitely my understanding that it could/should be possible although it still won’t facilitate different build time optimisations in LLVM but that might be okay?

Definitely worth an experiment.

Something to convert IntelPT to Tracy could also be worth a shot - I wanted to compare this as an alternative but the existing tools are far too sparsely documented particularly when working in Rust.

jasontt commented 3 months ago

I managed to get cross language LTO working although in a tight loop we really don't gain much.

Experiment commit is here


0000000000104ef0 <single_span>:
  104ef0:       50                      push   %rax
  104ef1:       48 8b 05 c8 e8 18 00    mov    0x18e8c8(%rip),%rax        # 2937c0 <_ZN6client11single_span3LOC17h197cedbae853ff01E+0x30>
  104ef8:       48 83 f8 02             cmp    $0x2,%rax
  104efc:       75 1c                   jne    104f1a <single_span+0x2a>
  104efe:       48 8d 3d 9b e8 18 00    lea    0x18e89b(%rip),%rdi        # 2937a0 <_ZN6client11single_span3LOC17h197cedbae853ff01E+0x10>
  104f05:       be 01 00 00 00          mov    $0x1,%esi
  104f0a:       67 e8 90 27 f8 ff       addr32 call 876a0 <___tracy_emit_zone_begin>
  104f10:       48 89 c7                mov    %rax,%rdi
  104f13:       58                      pop    %rax
  104f14:       e9 a7 2c f8 ff          jmp    87bc0 <___tracy_emit_zone_end>
  104f19:       90                      nop
  104f1a:       48 8d 3d 6f e8 18 00    lea    0x18e86f(%rip),%rdi        # 293790 <_ZN6client11single_span3LOC17h197cedbae853ff01E>
  104f21:       48 89 fe                mov    %rdi,%rsi
  104f24:       e8 07 03 ff ff          call   f5230 <_ZN9once_cell3imp17OnceCell$LT$T$GT$10initialize17h8ce5003803f1491dE>
  104f29:       eb d3                   jmp    104efe <single_span+0xe>


00000000000d2670 <single_span>:
   d2670:       41 57                   push   %r15
   d2672:       41 56                   push   %r14
   d2674:       41 54                   push   %r12
   d2676:       53                      push   %rbx
   d2677:       50                      push   %rax
   d2678:       48 8b 05 09 10 1c 00    mov    0x1c1009(%rip),%rax        # 293688 <_ZN6client11single_span3LOC17h197cedbae853ff01E+0x30>
   d267f:       48 83 f8 02             cmp    $0x2,%rax
   d2683:       0f 85 bf 00 00 00       jne    d2748 <single_span+0xd8>
   d2689:       f0 ff 05 48 15 1c 00    lock incl 0x1c1548(%rip)        # 293bd8 <_ZN5tracyL10s_profilerE.llvm.9957044460052729621+0x58>
   d2690:       e8 2b 2a 1b 00          call   2850c0 <_ZTHN5tracy7s_tokenE>
   d2695:       64 48 8b 04 25 00 00    mov    %fs:0x0,%rax
   d269c:       00 00 
   d269e:       4c 8d b8 e0 ff ff ff    lea    -0x20(%rax),%r15
   d26a5:       64 48 8b 1c 25 e0 ff    mov    %fs:0xffffffffffffffe0,%rbx
   d26ac:       ff ff 
   d26ae:       4c 8b 73 28             mov    0x28(%rbx),%r14
   d26b2:       4d 89 f4                mov    %r14,%r12
   d26b5:       49 81 e4 ff ff 00 00    and    $0xffff,%r12
   d26bc:       75 0b                   jne    d26c9 <single_span+0x59>
   d26be:       48 89 df                mov    %rbx,%rdi
   d26c1:       4c 89 f6                mov    %r14,%rsi
   d26c4:       e8 67 07 1b 00          call   282e30 <_ZN5tracy10moodycamel15ConcurrentQueueINS_9QueueItemENS0_28ConcurrentQueueDefaultTraitsEE16ExplicitProducer19enqueue_begin_allocEm>
   d26c9:       48 8b 4b 48             mov    0x48(%rbx),%rcx
   d26cd:       41 c1 e4 05             shl    $0x5,%r12d
   d26d1:       42 c6 04 21 0f          movb   $0xf,(%rcx,%r12,1)
   d26d6:       0f 31                   rdtsc
   d26d8:       48 c1 e2 20             shl    $0x20,%rdx
   d26dc:       48 01 c2                add    %rax,%rdx
   d26df:       4a 89 54 21 01          mov    %rdx,0x1(%rcx,%r12,1)
   d26e4:       48 8d 05 7d 0f 1c 00    lea    0x1c0f7d(%rip),%rax        # 293668 <_ZN6client11single_span3LOC17h197cedbae853ff01E+0x10>
   d26eb:       4a 89 44 21 09          mov    %rax,0x9(%rcx,%r12,1)
   d26f0:       49 ff c6                inc    %r14
   d26f3:       4c 89 73 28             mov    %r14,0x28(%rbx)
   d26f7:       e8 c4 29 1b 00          call   2850c0 <_ZTHN5tracy7s_tokenE>
   d26fc:       49 8b 1f                mov    (%r15),%rbx
   d26ff:       4c 8b 73 28             mov    0x28(%rbx),%r14
   d2703:       4d 89 f7                mov    %r14,%r15
   d2706:       49 81 e7 ff ff 00 00    and    $0xffff,%r15
   d270d:       75 0b                   jne    d271a <single_span+0xaa>
   d270f:       48 89 df                mov    %rbx,%rdi
   d2712:       4c 89 f6                mov    %r14,%rsi
   d2715:       e8 16 07 1b 00          call   282e30 <_ZN5tracy10moodycamel15ConcurrentQueueINS_9QueueItemENS0_28ConcurrentQueueDefaultTraitsEE16ExplicitProducer19enqueue_begin_allocEm>
   d271a:       48 8b 4b 48             mov    0x48(%rbx),%rcx
   d271e:       41 c1 e7 05             shl    $0x5,%r15d
   d2722:       42 c6 04 39 11          movb   $0x11,(%rcx,%r15,1)
   d2727:       0f 31                   rdtsc
   d2729:       48 c1 e2 20             shl    $0x20,%rdx
   d272d:       48 01 c2                add    %rax,%rdx
   d2730:       4a 89 54 39 01          mov    %rdx,0x1(%rcx,%r15,1)
   d2735:       49 ff c6                inc    %r14
   d2738:       4c 89 73 28             mov    %r14,0x28(%rbx)
   d273c:       48 83 c4 08             add    $0x8,%rsp
   d2740:       5b                      pop    %rbx
   d2741:       41 5c                   pop    %r12
   d2743:       41 5e                   pop    %r14
   d2745:       41 5f                   pop    %r15
   d2747:       c3                      ret
   d2748:       48 8d 3d 09 0f 1c 00    lea    0x1c0f09(%rip),%rdi        # 293658 <_ZN6client11single_span3LOC17h197cedbae853ff01E>
   d274f:       48 89 fe                mov    %rdi,%rsi
   d2752:       e8 59 02 ff ff          call   c29b0 <_ZN9once_cell3imp17OnceCell$LT$T$GT$10initialize17h8ce5003803f1491dE>
   d2757:       e9 2d ff ff ff          jmp    d2689 <single_span+0x19>
nagisa commented 3 months ago

Right, I don't think there's much to change here. The CC/CXX are already configurable by the users and CXXFLAGS for tracy-sys can be passed through TRACY_CLIENT_SYS_CXXFLAGS. The LTO configuration on the Rust side ought to be configured by the project that uses tracy, rather than within tracy bindings here.

Not sure why anyone would go through all this effort, given the negligible improvements.