Open Voultapher opened 2 years ago
I looked into that in the past, but it doesn't produce good results on my own system. I'm not quite sure whether the code is at fault or the compiler.
Is there any definite consensus on the right way to perform branchless swaps?
I'm not sure there is consensus, but I saw a very significant speedup with cmov vs setl/ge code on Zen3, Broadwell, Skylake and on Firestorm (M1) LLVM was already producing csel code for both versions. How did you test it? Because I notice you no_inline the comparison function and that has disastrous effects on performance. Here the difference to languages with template instantiation / monomorphization is the most acute. If I understand correctly you pull in everything into the header, akin to header only libraries. But even then LTO should level the playing field I guess.
When it comes to performance testing I always uncomment this line in bench.c
//#define cmp(a,b) (*(a) > *(b)) // uncomment for fast primitive comparisons
That allows a fair comparison against c++ sorts.
I took a closer look at this. As far as I can tell, overall branchless swap performance is worse for gcc and clang on my hardware.
Ideally, you get that cmov without too much hassle. The current branchless compilation situation is a royal mess.
In addition, clang performs horribly on most of my core algorithms, some code running 2x slower. Hopefully it's a simple fix.
@Voultapher
https://github.com/Voultapher/sort-research-rs/blob/main/writeup/glidesort_perf_analysis/text.md
Just saw your benchmark. I've recently released a fluxsort and quadsort update with compile-time optimizations for clang. Overall, quadsort should be the fastest sort for random when compiled with clang -O3 for smaller ranges.
I also added the quadsort_prim() and fluxsort_prim() functions so it's possible to benchmark 32/64 bit primitive integers and C strings with the same binary. The bench.c file contains an example for sorting C strings.
Pretty good overall performance for ipn_stable, is the performance on rand % 2
purely from an exponential search in a galloping merge?
I ran a benchmark of my own using rhsort's benchmark compiled with clang -O3. This suggests most of the performance gain on rust is from branchless ternary operations, though timsort does quite well on long variable runs.
The fundamental branchless swap_if code produces suboptimal code on x86-64. I ported it to Rust and noticed that changing it yielded a 50% performance uplift for that function on Zen3, this will of course depend on the the hardware, but cmov seems to yield better results than setl/setg style code that is currently being produced. Probably helped by doing 8 instead of 10 instructions.
Here is the current version:
And here is the version that produces cmov code:
I think if you can find a way to reliably produce cmov instructions like LLVM does, you should see a noticeable speed improvement.