Use poolstl to sort randstrobes in parallel (attempt two)

marcelm commented 5 months ago

We tried this before in #375, but one reason for closing the PR was that it was not possible to use a custom sort function (pdqsort_branchless in our case).

@alugowski has now added a pluggable_sort function, so we can try again.

Here are the sorting runtimes and memory usages for CHM13:

threads	only sorting	max mem
pdqsort_branchless (main)	33 s	14.3 GB
pluggable_sort, 1 thread	33 s	14.3 GB
pluggable sort, 2 threads	24 s	16.8 GB
pluggable sort, 4 threads	21 s	16.8 GB
pluggable sort, 8 threads	21 s	16.8 GB

This is actually the first time I’m able to index CHM13 in under one minute (total indexing time) on my (home) machine, which is quite nice. However, memory usage still increases quite a bit.

@alugowski Is the extra memory usage expected?

marcelm commented 5 months ago

Hm, I guess the answer may be that std::inplace_merge uses extra memory.

marcelm commented 5 months ago

Ah, C++ inplace is not actually "in place" (as I would have understood it).

alugowski commented 5 months ago

You're right, I also didn't realize how much memory inplace_merge might actually allocate. Is that too much for your application?

pluggable_sort also lets you replace inplace_merge with another implementation. I see papers about algorithms that use less temp space, but don't see ready to try code. Just for fun I tried a no-op to see which step dominates, and in my naive benchmark the merges account for over half of the runtime.

The apparent limit around 21s is likely memory. This can be confirmed by testing with an artificially computationally expensive comparator. Basically a std::less with a busy loop. The runtimes would obviously be longer, but you'd see scaling as memory wouldn't be the bottleneck anymore.

marcelm commented 5 months ago

You're right, I also didn't realize how much memory inplace_merge might actually allocate. Is that too much for your application?

It probably is. We’ve been successfully reducing memory usage of the program over the last versions and this would be a step backwards. The ~12 seconds that we save here are just spent once at the startup of the program, and then for typical use cases it runs for at least a couple of minutes or tens of minutes, so it’s just a small fraction of the total. Increased memory usage on the other hand would mean that some type of data can not be analyzed on a machine with a given amount of RAM.

The program creates an index data structure for a genome. The 14 GB vs 16 GB I provided above were memory usage for a particular human genome (CHM13), which we use for testing. We want strobealign to work with much bigger genomes where the absolute increase would be much higher.

pluggable_sort also lets you replace inplace_merge with another implementation. I see papers about algorithms that use less temp space, but don't see ready to try code. Just for fun I tried a no-op to see which step dominates, and in my naive benchmark the merges account for over half of the runtime.

The apparent limit around 21s is likely memory. This can be confirmed by testing with an artificially computationally expensive comparator. Basically a std::less with a busy loop. The runtimes would obviously be longer, but you'd see scaling as memory wouldn't be the bottleneck anymore.

Interesting. I probably don’t have time to do this experiment, but will keep it in mind.

I’ll leave the PR open until the owner of the repo @ksahlin has had time to comment (I know he’s busy at the moment).

alugowski commented 5 months ago

I went down an inplace_merge rabbit hole, and found something that seems usable. This SO post references an implementation of the "Practical In-Place Merging" paper. That code only supports merging ranges of equal sizes, but pluggable_sort emits as close to even sizes as possible and a naive adapter seems to work well. On my simple multigig sort benchmark pdqsort+this is faster than the plain poolSTL sort, and if the dataset is large enough to cause memory swapping then it's the outright fastest method (beating out pdqsort+std::inplace_merge).

I'm curious if you'd see the same benefits on your dataset.

#include "thirdparty/InplaceMerge.hh"

/**
 * Like `std::inplace_merge` but uses a buffer-free algorithm described by Huang
 * and Langston in "Practical In-Place Merging", Communications of the ACM, 1988, http://dx.doi.org/10.1145/42392.42403
 * Generalizes the implementation by Keith Schwarz: http://keithschwarz.com/interesting/code/?dir=inplace-merge
 *
 * Schwarz's implementation only supports merging two ranges of equal size. This adapter also handles cases where the
 * ranges are slightly different sizes, as happens when used as the merge step for a general-purpose sort.
 * Drastically different sizes effectively fallback to std::inplace_merge, which is not buffer-free.
 */
template<class RandIt, class Compare>
void adapted_pipm_inplace_merge(RandIt first, RandIt mid, RandIt last, Compare comp) {
    auto left_size = std::distance(first, mid);
    auto right_size = std::distance(mid, last);

    if (left_size == right_size) {
        InplaceMerge(first, last, comp);
    } else if (left_size < right_size) {
        auto extra = right_size - left_size;
        InplaceMerge(first, last - extra, comp);
        std::inplace_merge(first, last - extra, last, comp);
    } else {
        // left_size > right_size
        auto extra = left_size - right_size;
        InplaceMerge(first + extra, last, comp);
        std::inplace_merge(first, first + extra, last, comp);
    }
}

Again, original InplaceMerge.hh is here: http://keithschwarz.com/interesting/code/?dir=inplace-merge

Usage: just add as last argument to pluggable_sort:

poolstl::pluggable_sort(poolstl::par.on(pool), randstrobes.begin(), randstrobes.end(), pdqsort_branchless, adapted_pipm_inplace_merge);

I also tried the bufferless inplace merges from several libraries like wikisort, libc++ libstdc++, and cppsort's symmerge implementation. All of those are significantly slower than the above; it seems they have the fallback but don't expect anyone to need it.

Also note that supposedly std::inplace_merge only uses the additional buffer if there's memory for it. There is the slower non-buffer fallback, so you should be able to use on it even for larger datasets, as std::inplace_merge would fallback to the bufferless version if there isn't enough memory available. I agree with you though, I'd also prefer to see the measured memory usage be what I expect and not depend on that fallback behavior.

marcelm commented 5 months ago

Cool, thanks for this! I had also found that InplaceMerge.hh file during my searching, but would not have been able to adapt it for use with pluggable_sort.

I tested this in commit 699cbe1. So the good news is that indeed memory usage remains unchanged no matter how many threads are used. The bad news is that it seems to become slower with more threads:

threads	wall-clock time (only sorting)
1	30.5 s
2	29.7 s
4	33.2 s
8	39.0 s

I’m not sure what is going on (given that you see a speedup).

Also note that supposedly std::inplace_merge only uses the additional buffer if there's memory for it. There is the slower non-buffer fallback, so you should be able to use on it even for larger datasets, as std::inplace_merge would fallback to the bufferless version if there isn't enough memory available. I agree with you though, I'd also prefer to see the measured memory usage be what I expect and not depend on that fallback behavior.

Yes, the "if there’s memory for it" part is way too vague. I believe this cannot be reliably detected.

alugowski commented 5 months ago

Good news! I added a quicksort implementation that appears to work a lot better than the mergesort we've been trying. Unlike merging, partitioning is both fast and does not need extra memory.

In my benchmark (a vector of randomly shuffled integers) this is noticeably faster than the pdqsort-based mergesort we were working on earlier.

I've made it the default algorithm for pluggable_sort, which no longer accepts a merge method argument. That previous version is retained as pluggable_mergesort.

So back to this usage:

poolstl::pluggable_sort(poolstl::par.on(pool), randstrobes.begin(), randstrobes.end(), pdqsort_branchless);

(if you want to be explicit, this is now effectively poolstl::pluggable_quicksort. This also allows overriding the partitioning and pivot finding methods, but the defaults should be fine unless you need a stable sort).

I hope this gives you a speedup!

marcelm commented 5 months ago

Awesome! I hadn’t expected you to continue working on this, but this gives us indeed an impressive speedup (without increasing memory usage)!

threads	wall-clock time (sorting only)	wall-clock time overall
1	32.8 s	151 s
2	20.3 s	88 s
4	15.7 s	57 s
8	14.8 s	47 s

I’m going to polish the PR a bit so that it can be merged.

Can I ask one more thing?

There seems to be some small overhead now when calling pluggable_sort and using a single thread. Runtime when calling pdqsort_branchless directly is ~31.3 s, but going via pluggable_sort, it is now ~33s. I could add an if (threads == 1) branch to avoid the tiny slowdown, but it would be nice if pluggable_sort would recognize this case.

I’m really impressed, thanks again!

ksahlin commented 5 months ago

Great, since it now offers a speed-up without increasing memory usage, I am all for it.

alugowski commented 5 months ago

Great! I'm glad it's working well for you!

There seems to be some small overhead now when calling pluggable_sort and using a single thread.

Good point, that is an easy fix. I'll try to get it done tomorrow

alugowski commented 5 months ago

The p=1 slowdown is fixed.

I also saw an opportunity to parallelize the partitioning more. This is showing an extra ~20% speedup in my benchmark.

marcelm commented 5 months ago

Thanks (and sorry for the delay in responding)! The p=1 slowdown is indeed gone. I am not sure I see an additional speedup due to the new partitioning, but the variance is a bit high in my measurements anyway.

Annoyingly, I realized there’s one more problem that I hadn’t thought about before: The order of elements in the sorted vector depends on the number of threads, which means that we don’t get reproducible results. This is quite important in our domain.

The easiest solution would be to hard-code the number of threads for the sorting step (perhaps to 8), but overriding the user’s wishes of how many threads to use wouldn’t be a good experience. (Or maybe it doesn’t matter?) We would at least need to document this well.

Hm, or perhaps we could use a constant number of threads for sorting, but restrict the number of CPU cores used with sched_setaffinity. So while 8 (or so) threads would always be running, they would only use as many CPU cores as specified by the user.

alugowski commented 5 months ago

Oooh, that's playing with fire. Using a fixed pool size should do it, but I'm using std::partition which makes no determinism guarantees (though almost certainly is deterministic). You could use pluggable_quicksort and use std::stable_partition. But nasty, doubly so if you start messing with thread affinities.

If you're able to identify elements that have changed order, does that mean you could impart an order on equal elements? Like a secondary key?

If my quick glance is correct and this is your comparator, could something like this branchless comparator using position as a secondary key work:

bool branchless_compare(const RefRandstrobe& lhs, const RefRandstrobe& rhs) {
    bool hash_eq = (lhs.hash == rhs.hash);
    return ((lhs.hash < rhs.hash) & ~hash_eq) | ((lhs.position < rhs.position) & hash_eq);
}

I haven't tested that. If too ugly try using std::tie; I would not be surprised if both clang and gcc emit something branchless once compiled with optimization.

marcelm commented 4 months ago

Thanks once again for your advice! We actually had std::tie in the comparison function, but I removed it in 26b1992c0899b367099684a326cf927d8e0f52f2 to speed up sorting. I think you have a good point that unstable sorting is not necessarily deterministic and that it would actually be good for reproducibility to compare more than just the hash. But more important is of course that we could get parallel sorting and be reproducible.

The advice of using a branchless comparison is great. I tried your branchless_compare, and it works better than std::tie, but single-threaded sorting time still increased quite a bit from 32.4 to 42.4 seconds. I wasn’t quite ready to accept that, so I tried a version using 128 bit ints:

    bool operator< (const RefRandstrobe& other) const {
        __uint128_t lhs = (static_cast<__uint128_t>(hash) << 64) | position;
        __uint128_t rhs = (static_cast<__uint128_t>(other.hash) << 64) | other.position;
        return lhs < rhs;
    }

and this increases sorting time to just 35 seconds, which is quite acceptable given that we’re going to be sorting with multiple threads anyway most of the time.

Here’s the disassembly of the three comparison functions: https://godbolt.org/z/TPYvT5ocK std::tie is unfortunately not branchless, which of course explains its slowness.

I’ll read through that article you linked. We may be able to apply the techniques to some other code that we have in inner loops.

Runtimes for sorting with one thread

Comparison	time
By hash only (what we had before)	32 s
By hash and position using `std::tie`	45 s
By hash and position using `branchless_compare`	42 s
By hash and position using a `__uint128_t`	35 s

Final runtimes

This was measured on an AMD CPU with four cores and hyperthreading

threads	sorting time	index creation time	previous index creation time
1	35 s	154 s	148 s
2	25 s	95 s	99 s
4	16 s	60 s	75 s
8	15 s	48 s	64 s

Thanks again for reaching out! I’m going to merge this as soon as CI passes.

ksahlin / strobealign

Use poolstl to sort randstrobes in parallel (attempt two) #386

Runtimes for sorting with one thread

Final runtimes