judofyr / spice

Fine-grained parallelism with sub-nanosecond overhead in Zig
BSD Zero Clause License
748 stars 13 forks source link

Rust baseline compiled assembly difference #6

Open rpjohnst opened 3 months ago

rpjohnst commented 3 months ago

From the README:

(It's not entirely clear why the Zig baseline implementation is twice as fast as the Rust implementation. The compiled assembly (godbolt) show that Rust saves five registers on the stack while Zig only saves three, but why? For the purpose of this benchmark it shouldn't matter since we're only comparing against the baseline of each language.)

The difference is that the linked Zig program produces an internal LLVM function, which can call itself directly, while the Rust program produces a non-internal LLVM function, which calls itself through the GOT. If you mark the Rust function non-pub and call it from a pub function (like the Zig main), you will get essentially the same assembly: https://godbolt.org/z/x73v9zKb9

judofyr commented 3 months ago

Oooh, that's interesting! Let's see if this has an impact on the benchmarks results as well. Initially I had the function in the same file as the benchmark, but I can't remember if it was pub or not. I'll see if I get different results by putting it directly inside the benchmark file and marking it non-pub.

I guess the overhead between pub and non-pub is very small since they still follow the optimized calling convention? Is there any documentation around this somewhere I could read up on?

rpjohnst commented 3 months ago

It looks like even pub items can become internal when the final binary artifact is linked the right way- Godbolt probably just so happens to be configured to produce an artifact type meant to be dynamically link(able/ed). For example, on my machine (macOS) a pub function in a binary/executable crate type gets generated as internal.

I would not expect to find a lot of docs on this, because rustc seems to just choose the best linkage given the artifact type it's generating, and this can change wildly based on all kinds of factors - which OS, where in the dependency graph the function is compiled (which itself depends on whether it is generic, inlinable, the optimization level, flags like -Z share-generics, etc), static vs dynamic linking, and so on.