Switch to in-place std::stable_sort

higher-performance commented 1 year ago

Hi, I just wanted to suggest switching to an in-place std::stable_sort implementation - one candidate for this being WikiSort.

These are relatively modern algorithms designed to asymptotically minimize the number of comparisons, swaps, etc. while avoiding dynamic memory allocation, and it would be great to utilize them.

philnik777 commented 1 year ago

If you want, feel free to provide a patch. It's not super likely that any regular contributor will work on this, since there is a lot of other stuff to do in the library and this is a lot of work.

higher-performance commented 1 year ago

I could consider working on it (slowly), but is the implementation the only obstacle here? Are there e.g. licensing or other concerns (readability, template costs, etc.) that might need to be sorted (heh) out first?
Note WikiSort uses the "Unlicense", and I'm not sure if LLVM finds that acceptable or not.

philnik777 commented 1 year ago

I'm not a lawyer, but looking at the license it seems that you are free to re-license it. If you want to be sure you could also ask @BonzaiThePenguin to give you the source under the LLVM license. Although it should also be fine without that if you implement it yourself (which you probably have to do anyways), since the algorithm itself can't be protected under any copyright law that I know of, and I assume that it's not patented.

higher-performance commented 1 year ago

Oh, I would definitely not have the bandwidth to try to implement this myself - if I did work on this I would just use the copy that's already there.

Hmm okay, let me look into it. If it doesn't look too painful I might give it a try, but no guarantees I guess -- if anyone reading this is interested feel free to post here and take a stab as well.

philnik777 commented 1 year ago

Looking through the blame, @Morwenn also worked on it quite a bit.

Morwenn commented 1 year ago

I merely did some mechanical transformations to make it a bit more versatile, I don't actually understand how the algorithm works. I can only testify that the algorithm is rather complicated and the original implementation mostly consists of one huge function. If you want O(n log n) stable sorts that work with O(1) extra memory that do happen to have a C++ implementation, there's also GrailSort, but the original implementation isn't extremely understandable either.

higher-performance commented 1 year ago

Interesting, thanks. One thing I do know is that some of the algorithms in this area don't minimize comparisons or assignments. GrailSort seems to be one of those from a quick glance at the paper (though I'm not certain) whereas the nice thing about the ratio-based sort (as well as some others like this) is that it's asymptotically optimal on both of those fronts. For large/slow-to-compare objects I imagine these can make a difference.

asl commented 1 year ago

I guess the main question here is when / where we'd benefit from different sort implementation. Is there a bottleneck somewhere? Or at least sort appearing somewhere on profile?

philnik777 commented 1 year ago

I guess the main question here is when / where we'd benefit from different sort implementation. Is there a bottleneck somewhere? Or at least sort appearing somewhere on profile?

I don't think that's interesting here. It's the standard library; someone will be affected by improving the performance of an algorithm, even if you yourself don't get a huge improvement. You shouldn't have to use another library to get the best performance possible (assuming the API is comparable to the standard library one).

Morwenn commented 1 year ago

I have some very basic benchmarks here, certainly not perfect and extremely incomplete, but it can help to get an idea of how the different stable sorting algorithms I have at hand perform:

Benchmark stable sorts over different patterns for std::vector<double>

To give some keys to understand the graphs better:

It benchmarks the numbers of cycle/element needed to sort a collection (here, vector and deque).
block_sort is actually WikiSort, I changed the name back to wiki_sort more recently because there's more and more block sorts around, but the benchmarks are old since I haven't touched stable sorting algorithms in a long time.
WikiSort uses a fixed buffer of 512 elements in this benchmark, GrailSort uses no buffer at all - so effectively both work in O(1) memory. If I remember correctly I picked those values because having a "small" buffer made a big difference for WikiSort but not so much for GrailSort.
When you give them as much memory as possible they basically become some kinds of mergesorts, but slower.
stable_adapter(std_sorter) is just std::stable_sort, though the libstdc++ one, I don't have benchmarks featuring the libc++ implementation.
Spinsort comes from Boost.Sort and consistently beats the other algorithms here.
stable_adapter(pdq_sort) just adds the information of the original position of the elements in the collection and adds it to the comparison function it passes to pdqsort. The idea was to see how much an unstable algorithm could performed when rendered artificially stable.
tim_sort is basically cpp-TimSort.
merge_sort is a bog standard mergesort, without lots of fancy optimizations.

I have additional graphs comparing these algorithms when no heap memory is available. I forgot to include std::stable_sort, but my mergesort relies on an inplace_merge implementation taken from libc++, so std::stable_sort should show similar results.

The existing C++ implementations of WikiSort and GrailSort do perform much better than std::stable_sort when no heap memory is available in my very limited benchmarks, but they're far from performing as well when heap memory is available - which I would consider to be the standard scenario -, and they're both much more complicated.

All in all I would rather look in the direction of spinsort - which is some kind of optimized mergesort if I'm not mistaken - to check whether it's possible to borrow some of its optimizations and retrofit them into the existing std::stable_sort implementation while gracefully degrading to O(n log² n) when fewer heap memory is available. The current implementation of spinsort unconditionally allocates memory for n elements, so it can't be used as is without being modified.

Morwenn commented 1 year ago

Additionally I know that people are working on other block sorts that reuse ideas similar to those used in WikiSort and GrailSort, taking them to another level, but I don't think there's any widely available documentation about those, let alone production-ready implementations around. So while they might be promising, they're more in the realm of research projects right now and I've got no idea whether they can even compete with std::stable_sort when additional heap memory is available.

higher-performance commented 1 year ago

double is a nice starting point, but it can be misleading for real-world usage. That's because the selling point of these sorts isn't just the memory usage - it's the minimization of swaps & comparisons as well. That kind of thing won't show in the benchmark of a lightweight type like double using default comparisons - for those cases, you'd probably want to fall back on a faster implementation whenever the comparator is the built-in one anyway. Instead, you'd want to benchmark larger types with more expensive comparisons that you can't just redirect to a faster implementation under the hood. (Edit: This is SLOW_COMPARISONS in their benchmark.)

higher-performance commented 1 year ago

Looks like we'd need to make some changes to the algorithm before using it - right now it seems to assume copyability + default-constructability.

Morwenn commented 1 year ago

I've got a fork which solves both of those issues here: https://github.com/Morwenn/cpp-sort/blob/develop/include/cpp-sort/detail/wiki_sort.h

higher-performance commented 1 year ago

How do you avoid the default-constructability requirement? When I look at the code, it seems to assume default constructability:

        typename BufferProvider::template buffer<rvalue_type> cache(last - first);

        Wiki::sort(std::move(first), std::move(last),
                   cache.begin(), cache.size(),
                   std::move(compare), std::move(projection));

frederick-vs-ja commented 1 year ago

I guess we can use something like alignas(T) unsigned char buf[N * sizeof(T)];, and use reinterpret_cast<T(*)[N]>(buf) to access the implicitly created T[N] array in it - this should be valid since array types are implicit-lifetime types, and then default construction is avoided. Then we should use move construction instead of move assignment.

higher-performance commented 1 year ago

It's certainly solvable, but it might not be as trivial as that since you need exception safety. In the best case scenario you could just use std::uninitialized_move, but I'm not sure if that's adequate given that interleaving is necessary in some places.

Morwenn commented 1 year ago

@higher-performance You're right, I forgot that it might be one of the few places where I don't handle it correctly and there's probably UB around, my bad. The idea for the next breaking change was to use "relocations" instead of moves, so move-constructing values into an uninitialized buffer when possible, and destroying the elements where they come from. I haven't started to implement that strategy yet, so I don't know how feasible it actually is.

higher-performance commented 1 year ago

At first glance it looks feasible to me; the main thing to watch out for seems to be to make sure any additional bookkeeping for exception safety gets optimized out for the is_nothrow_move_constructible case. I'd have to try it and see to make sure, though.

llvm / llvm-project

Switch to in-place std::stable_sort #61459