Suboptimal code-gen in the fundamental branchless-swap building block

Voultapher commented 2 years ago

The fundamental branchless swap_if code produces suboptimal code on x86-64. I ported it to Rust and noticed that changing it yielded a 50% performance uplift for that function on Zen3, this will of course depend on the the hardware, but cmov seems to yield better results than setl/setg style code that is currently being produced. Probably helped by doing 8 instead of 10 instructions.

Here is the current version:

And here is the version that produces cmov code:

C https://godbolt.org/z/GrTvx1z8x (WIP, only good code gen for clang LLVM)
Rust https://godbolt.org/z/9qnfY6h3v

I think if you can find a way to reliably produce cmov instructions like LLVM does, you should see a noticeable speed improvement.

scandum commented 2 years ago

I looked into that in the past, but it doesn't produce good results on my own system. I'm not quite sure whether the code is at fault or the compiler.

Is there any definite consensus on the right way to perform branchless swaps?

Voultapher commented 2 years ago

I'm not sure there is consensus, but I saw a very significant speedup with cmov vs setl/ge code on Zen3, Broadwell, Skylake and on Firestorm (M1) LLVM was already producing csel code for both versions. How did you test it? Because I notice you no_inline the comparison function and that has disastrous effects on performance. Here the difference to languages with template instantiation / monomorphization is the most acute. If I understand correctly you pull in everything into the header, akin to header only libraries. But even then LTO should level the playing field I guess.

scandum commented 2 years ago

When it comes to performance testing I always uncomment this line in bench.c

//#define cmp(a,b) (*(a) > *(b)) // uncomment for fast primitive comparisons

That allows a fair comparison against c++ sorts.

scandum commented 1 year ago

I took a closer look at this. As far as I can tell, overall branchless swap performance is worse for gcc and clang on my hardware.

Ideally, you get that cmov without too much hassle. The current branchless compilation situation is a royal mess.

In addition, clang performs horribly on most of my core algorithms, some code running 2x slower. Hopefully it's a simple fix.

scandum commented 1 year ago

@Voultapher

https://github.com/Voultapher/sort-research-rs/blob/main/writeup/glidesort_perf_analysis/text.md

Just saw your benchmark. I've recently released a fluxsort and quadsort update with compile-time optimizations for clang. Overall, quadsort should be the fastest sort for random when compiled with clang -O3 for smaller ranges.

I also added the quadsort_prim() and fluxsort_prim() functions so it's possible to benchmark 32/64 bit primitive integers and C strings with the same binary. The bench.c file contains an example for sorting C strings.

Pretty good overall performance for ipn_stable, is the performance on rand % 2 purely from an exponential search in a galloping merge?

I ran a benchmark of my own using rhsort's benchmark compiled with clang -O3. This suggests most of the performance gain on rust is from branchless ternary operations, though timsort does quite well on long variable runs.

data table

| Name | Items | Type | Best | Average | Loops | Samples | Distribution | | --------- | -------- | ---- | -------- | -------- | --------- | ------- | ---------------- | | quadsort | 131072 | 32 | 0.002134 | 0.002152 | 0 | 100 | random order | | fluxsort | 131072 | 32 | 0.002464 | 0.002502 | 0 | 100 | random order | | glidesort | 131072 | 32 | 0.002999 | 0.003017 | 0 | 100 | random order | | | | | | | | | | | quadsort | 131072 | 32 | 0.001709 | 0.001733 | 0 | 100 | random % 100 | | fluxsort | 131072 | 32 | 0.000902 | 0.000908 | 0 | 100 | random % 100 | | glidesort | 131072 | 32 | 0.001011 | 0.001035 | 0 | 100 | random % 100 | | | | | | | | | | | quadsort | 131072 | 32 | 0.000061 | 0.000062 | 0 | 100 | ascending order | | fluxsort | 131072 | 32 | 0.000058 | 0.000059 | 0 | 100 | ascending order | | glidesort | 131072 | 32 | 0.000091 | 0.000092 | 0 | 100 | ascending order | | | | | | | | | | | quadsort | 131072 | 32 | 0.000335 | 0.000349 | 0 | 100 | ascending saw | | fluxsort | 131072 | 32 | 0.000334 | 0.000339 | 0 | 100 | ascending saw | | glidesort | 131072 | 32 | 0.000346 | 0.000356 | 0 | 100 | ascending saw | | | | | | | | | | | quadsort | 131072 | 32 | 0.000231 | 0.000242 | 0 | 100 | pipe organ | | fluxsort | 131072 | 32 | 0.000222 | 0.000229 | 0 | 100 | pipe organ | | glidesort | 131072 | 32 | 0.000229 | 0.000239 | 0 | 100 | pipe organ | | | | | | | | | | | quadsort | 131072 | 32 | 0.000073 | 0.000081 | 0 | 100 | descending order | | fluxsort | 131072 | 32 | 0.000073 | 0.000082 | 0 | 100 | descending order | | glidesort | 131072 | 32 | 0.000105 | 0.000109 | 0 | 100 | descending order | | | | | | | | | | | quadsort | 131072 | 32 | 0.000366 | 0.000369 | 0 | 100 | descending saw | | fluxsort | 131072 | 32 | 0.000348 | 0.000354 | 0 | 100 | descending saw | | glidesort | 131072 | 32 | 0.000357 | 0.000361 | 0 | 100 | descending saw | | | | | | | | | | | quadsort | 131072 | 32 | 0.000687 | 0.000702 | 0 | 100 | random tail | | fluxsort | 131072 | 32 | 0.000792 | 0.000819 | 0 | 100 | random tail | | glidesort | 131072 | 32 | 0.000939 | 0.000970 | 0 | 100 | random tail | | | | | | | | | | | quadsort | 131072 | 32 | 0.001177 | 0.001200 | 0 | 100 | random half | | fluxsort | 131072 | 32 | 0.001384 | 0.001401 | 0 | 100 | random half | | glidesort | 131072 | 32 | 0.001625 | 0.001652 | 0 | 100 | random half | | | | | | | | | | | quadsort | 131072 | 32 | 0.001643 | 0.001686 | 0 | 100 | ascending tiles | | fluxsort | 131072 | 32 | 0.000579 | 0.000590 | 0 | 100 | ascending tiles | | glidesort | 131072 | 32 | 0.002516 | 0.002543 | 0 | 100 | ascending tiles | | | | | | | | | | | quadsort | 131072 | 32 | 0.002184 | 0.002199 | 0 | 100 | bit reversal | | fluxsort | 131072 | 32 | 0.002223 | 0.002257 | 0 | 100 | bit reversal | | glidesort | 131072 | 32 | 0.002735 | 0.002765 | 0 | 100 | bit reversal | | | | | | | | | | | quadsort | 131072 | 32 | 0.001456 | 0.001474 | 0 | 100 | random % 2 | | fluxsort | 131072 | 32 | 0.000359 | 0.000364 | 0 | 100 | random % 2 | | glidesort | 131072 | 32 | 0.000443 | 0.000464 | 0 | 100 | random % 2 | | | | | | | | | | | quadsort | 131072 | 32 | 0.001332 | 0.001362 | 0 | 100 | signal | | fluxsort | 131072 | 32 | 0.001587 | 0.001602 | 0 | 100 | signal | | glidesort | 131072 | 32 | 0.003688 | 0.003711 | 0 | 100 | signal | | | | | | | | | | | quadsort | 131072 | 32 | 0.001923 | 0.001947 | 0 | 100 | exponential | | fluxsort | 131072 | 32 | 0.001281 | 0.001291 | 0 | 100 | exponential | | glidesort | 131072 | 32 | 0.002313 | 0.002335 | 0 | 100 | exponential |

scandum / fluxsort

Suboptimal code-gen in the fundamental branchless-swap building block #5