Closed jasontt closed 3 months ago
My immediate observation is that inlining with LTO is going to work just fine. Even if there's use of dynamic dispatch for any reason, devirtualization can help in some cases. The major cost centre is rather the allocation/handling of srcloc
s…
I wouldn't mind experiments trying to bind the C++ API instead, though. If nothing else, the C++ API is more of a first class citizen, but I don't expect there to be any meaningful performance implications here.
So to be clear the Tracy C API is not inlined into our spans regardless of your Cargo profile settings. You can objdump the resulting binary and clearly see the jumps in the assembly. As far as I can tell this seems to be expected when using functions via extern C in Rust but I would be very happy to be proven otherwise.
If I add this function to tracy-client/benches/client.rs
you can see this clearly along with some other interesting things of note.
#[no_mangle]
pub fn single_span() {
let _ = tracy_client::span!("single_span", 0);
}
single_span:
push rax
mov rax, qword ptr [rip + client::single_span::LOC+48]
cmp rax, 2
jne .LBB245_1
.LBB245_2:
lea rdi, [rip + client::single_span::LOC+16]
mov esi, 1
call qword ptr [rip + ___tracy_emit_zone_begin@GOTPCREL]
mov rdi, rax
pop rax
jmp qword ptr [rip + ___tracy_emit_zone_end@GOTPCREL]
.LBB245_1:
lea rdi, [rip + client::single_span::LOC]
mov rsi, rdi
call once_cell::imp::OnceCell<T>::initialize
jmp .LBB245_2
So what's going on here:
OnceCell<T>::initialize
, we get an extra cmp
and jne
but nothing majorThis very well may be different for some of the other Spans (non zero callstack depth etc) as I have not extensively looked into these but as far as our fastest spans are concerned the only optimisations that we can really make would be to:
I've done some of my own experiments into 2. which I will aim to report back here once I've bottomed out my ideas a bit more. I have some code at the moment that does all spans in native rust and dumps into the Tracy client at program close but I'm not sure if/how that fits in with this crate at this stage.
From an old blog post it sounds like cross-language LTO should be feasible but may be a bit fragile. Perhaps some adjustment is needed to how the C/C++ code is built?
It’s definitely my understanding that it could/should be possible although it still won’t facilitate different build time optimisations in LLVM but that might be okay?
Definitely worth an experiment.
Something to convert IntelPT to Tracy could also be worth a shot - I wanted to compare this as an alternative but the existing tools are far too sparsely documented particularly when working in Rust.
I managed to get cross language LTO working although in a tight loop we really don't gain much.
Experiment commit is here
Before:
0000000000104ef0 <single_span>:
104ef0: 50 push %rax
104ef1: 48 8b 05 c8 e8 18 00 mov 0x18e8c8(%rip),%rax # 2937c0 <_ZN6client11single_span3LOC17h197cedbae853ff01E+0x30>
104ef8: 48 83 f8 02 cmp $0x2,%rax
104efc: 75 1c jne 104f1a <single_span+0x2a>
104efe: 48 8d 3d 9b e8 18 00 lea 0x18e89b(%rip),%rdi # 2937a0 <_ZN6client11single_span3LOC17h197cedbae853ff01E+0x10>
104f05: be 01 00 00 00 mov $0x1,%esi
104f0a: 67 e8 90 27 f8 ff addr32 call 876a0 <___tracy_emit_zone_begin>
104f10: 48 89 c7 mov %rax,%rdi
104f13: 58 pop %rax
104f14: e9 a7 2c f8 ff jmp 87bc0 <___tracy_emit_zone_end>
104f19: 90 nop
104f1a: 48 8d 3d 6f e8 18 00 lea 0x18e86f(%rip),%rdi # 293790 <_ZN6client11single_span3LOC17h197cedbae853ff01E>
104f21: 48 89 fe mov %rdi,%rsi
104f24: e8 07 03 ff ff call f5230 <_ZN9once_cell3imp17OnceCell$LT$T$GT$10initialize17h8ce5003803f1491dE>
104f29: eb d3 jmp 104efe <single_span+0xe>
After:
00000000000d2670 <single_span>:
d2670: 41 57 push %r15
d2672: 41 56 push %r14
d2674: 41 54 push %r12
d2676: 53 push %rbx
d2677: 50 push %rax
d2678: 48 8b 05 09 10 1c 00 mov 0x1c1009(%rip),%rax # 293688 <_ZN6client11single_span3LOC17h197cedbae853ff01E+0x30>
d267f: 48 83 f8 02 cmp $0x2,%rax
d2683: 0f 85 bf 00 00 00 jne d2748 <single_span+0xd8>
d2689: f0 ff 05 48 15 1c 00 lock incl 0x1c1548(%rip) # 293bd8 <_ZN5tracyL10s_profilerE.llvm.9957044460052729621+0x58>
d2690: e8 2b 2a 1b 00 call 2850c0 <_ZTHN5tracy7s_tokenE>
d2695: 64 48 8b 04 25 00 00 mov %fs:0x0,%rax
d269c: 00 00
d269e: 4c 8d b8 e0 ff ff ff lea -0x20(%rax),%r15
d26a5: 64 48 8b 1c 25 e0 ff mov %fs:0xffffffffffffffe0,%rbx
d26ac: ff ff
d26ae: 4c 8b 73 28 mov 0x28(%rbx),%r14
d26b2: 4d 89 f4 mov %r14,%r12
d26b5: 49 81 e4 ff ff 00 00 and $0xffff,%r12
d26bc: 75 0b jne d26c9 <single_span+0x59>
d26be: 48 89 df mov %rbx,%rdi
d26c1: 4c 89 f6 mov %r14,%rsi
d26c4: e8 67 07 1b 00 call 282e30 <_ZN5tracy10moodycamel15ConcurrentQueueINS_9QueueItemENS0_28ConcurrentQueueDefaultTraitsEE16ExplicitProducer19enqueue_begin_allocEm>
d26c9: 48 8b 4b 48 mov 0x48(%rbx),%rcx
d26cd: 41 c1 e4 05 shl $0x5,%r12d
d26d1: 42 c6 04 21 0f movb $0xf,(%rcx,%r12,1)
d26d6: 0f 31 rdtsc
d26d8: 48 c1 e2 20 shl $0x20,%rdx
d26dc: 48 01 c2 add %rax,%rdx
d26df: 4a 89 54 21 01 mov %rdx,0x1(%rcx,%r12,1)
d26e4: 48 8d 05 7d 0f 1c 00 lea 0x1c0f7d(%rip),%rax # 293668 <_ZN6client11single_span3LOC17h197cedbae853ff01E+0x10>
d26eb: 4a 89 44 21 09 mov %rax,0x9(%rcx,%r12,1)
d26f0: 49 ff c6 inc %r14
d26f3: 4c 89 73 28 mov %r14,0x28(%rbx)
d26f7: e8 c4 29 1b 00 call 2850c0 <_ZTHN5tracy7s_tokenE>
d26fc: 49 8b 1f mov (%r15),%rbx
d26ff: 4c 8b 73 28 mov 0x28(%rbx),%r14
d2703: 4d 89 f7 mov %r14,%r15
d2706: 49 81 e7 ff ff 00 00 and $0xffff,%r15
d270d: 75 0b jne d271a <single_span+0xaa>
d270f: 48 89 df mov %rbx,%rdi
d2712: 4c 89 f6 mov %r14,%rsi
d2715: e8 16 07 1b 00 call 282e30 <_ZN5tracy10moodycamel15ConcurrentQueueINS_9QueueItemENS0_28ConcurrentQueueDefaultTraitsEE16ExplicitProducer19enqueue_begin_allocEm>
d271a: 48 8b 4b 48 mov 0x48(%rbx),%rcx
d271e: 41 c1 e7 05 shl $0x5,%r15d
d2722: 42 c6 04 39 11 movb $0x11,(%rcx,%r15,1)
d2727: 0f 31 rdtsc
d2729: 48 c1 e2 20 shl $0x20,%rdx
d272d: 48 01 c2 add %rax,%rdx
d2730: 4a 89 54 39 01 mov %rdx,0x1(%rcx,%r15,1)
d2735: 49 ff c6 inc %r14
d2738: 4c 89 73 28 mov %r14,0x28(%rbx)
d273c: 48 83 c4 08 add $0x8,%rsp
d2740: 5b pop %rbx
d2741: 41 5c pop %r12
d2743: 41 5e pop %r14
d2745: 41 5f pop %r15
d2747: c3 ret
d2748: 48 8d 3d 09 0f 1c 00 lea 0x1c0f09(%rip),%rdi # 293658 <_ZN6client11single_span3LOC17h197cedbae853ff01E>
d274f: 48 89 fe mov %rdi,%rsi
d2752: e8 59 02 ff ff call c29b0 <_ZN9once_cell3imp17OnceCell$LT$T$GT$10initialize17h8ce5003803f1491dE>
d2757: e9 2d ff ff ff jmp d2689 <single_span+0x19>
Right, I don't think there's much to change here. The CC/CXX
are already configurable by the users and CXXFLAGS
for tracy-sys can be passed through TRACY_CLIENT_SYS_CXXFLAGS
. The LTO configuration on the Rust side ought to be configured by the project that uses tracy, rather than within tracy bindings here.
Not sure why anyone would go through all this effort, given the negligible improvements.
As mentioned in my
TRACY_NO_VERIFY
PR - I have been exploring alternate methods for writing Spans to Tracy other than via their C API and the C bindings provided by this library.My current approach to this has been to git sub-module Tracy and use cxx to create a new C API from Tracy's public C++ API. My main motivation for this is to address a couple of drawbacks of the current approach:
Now while it is perfectly viable for me to continue with this approach, it may be nice to centralise the binding of Tracy in
tracy-client-sys
, additionally I can imagine users potentially wanting to mix and match features fromtracy-client
/tracy-client-sys
with Span logic I've built in my own crate - doing this without a common dependency on Tracy feels fragile. Thus I'm wondering if this project has any opinions on this matter? Perhaps we could separate libTracyClient into a new crate?