Closed marcelm closed 6 months ago
I see. Good to test alternative approaches nevertheless.
For reference: The code for doing this is not super complicated (although I don’t understand everything, yet): https://github.com/alugowski/poolSTL/blob/89fde73f0a8b818df45082ebdfca5abee1ec02c4/include/poolstl/internal/ttp_impl.hpp#L132 So maybe we could adjust it to use pdqsort.
Also, we already create the randstrobes for each chromosome in parallel, so we could also sort each chromosome in parallel. Then the only remaining thing to do serially would be to merge the chromosomes into one sorted vector. I’m not sure how to do this without using extra memory, though.
poolSTL author here. Thanks for giving it a try @marcelm!
For reference: The code for doing this is not super complicated (although I don’t understand everything, yet): https://github.com/alugowski/poolSTL/blob/89fde73f0a8b818df45082ebdfca5abee1ec02c4/include/poolstl/internal/ttp_impl.hpp#L132
The algorithm is simple: for p threads split the input range into p chunks. In parallel, run the sequential sort function on each chunk. Finish with a tree merge of the chunks, in parallel as much as possible.
So maybe we could adjust it to use pdqsort.
I just added a pluggable_sort()
that makes this easy:
poolstl::pluggable_sort(poolstl::par.on(pool), randstrobes.begin(), randstrobes.end(), pdqsort_branchless);
Cool, thanks! I opened #385 to track this. We’ll let you know how it goes.
This documents a change that I would not like to merge.
See https://github.com/alugowski/poolSTL
Sorting the randstrobes is currently a bottleneck in index generation as it does not run in parallel. This is an attempt to parallelize it.
poolstl’s sort uses regular std::sort under the hood. We currently use pdqsort_branchless, which is about twice as fast as std::sort, so parallel sorting breaks even with pdqsort_branchless at about 2-3 threads. It gets faster with more threads, but not as much as one would perhaps expect. Here are the sorting runtimes for CHM13:
Another issue is that sorting is no longer in place, so memory usage goes up by a couple of gigabytes, which is another reason for not making this change.