ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Failed experiment: Use poolstl to sort randstrobes in parallel #375

Closed marcelm closed 6 months ago

marcelm commented 6 months ago

This documents a change that I would not like to merge.

See https://github.com/alugowski/poolSTL

Sorting the randstrobes is currently a bottleneck in index generation as it does not run in parallel. This is an attempt to parallelize it.

poolstl’s sort uses regular std::sort under the hood. We currently use pdqsort_branchless, which is about twice as fast as std::sort, so parallel sorting breaks even with pdqsort_branchless at about 2-3 threads. It gets faster with more threads, but not as much as one would perhaps expect. Here are the sorting runtimes for CHM13:

Another issue is that sorting is no longer in place, so memory usage goes up by a couple of gigabytes, which is another reason for not making this change.

ksahlin commented 6 months ago

I see. Good to test alternative approaches nevertheless.

marcelm commented 6 months ago

For reference: The code for doing this is not super complicated (although I don’t understand everything, yet): https://github.com/alugowski/poolSTL/blob/89fde73f0a8b818df45082ebdfca5abee1ec02c4/include/poolstl/internal/ttp_impl.hpp#L132 So maybe we could adjust it to use pdqsort.

Also, we already create the randstrobes for each chromosome in parallel, so we could also sort each chromosome in parallel. Then the only remaining thing to do serially would be to merge the chromosomes into one sorted vector. I’m not sure how to do this without using extra memory, though.

alugowski commented 5 months ago

poolSTL author here. Thanks for giving it a try @marcelm!

For reference: The code for doing this is not super complicated (although I don’t understand everything, yet): https://github.com/alugowski/poolSTL/blob/89fde73f0a8b818df45082ebdfca5abee1ec02c4/include/poolstl/internal/ttp_impl.hpp#L132

The algorithm is simple: for p threads split the input range into p chunks. In parallel, run the sequential sort function on each chunk. Finish with a tree merge of the chunks, in parallel as much as possible.

So maybe we could adjust it to use pdqsort.

I just added a pluggable_sort() that makes this easy:

poolstl::pluggable_sort(poolstl::par.on(pool), randstrobes.begin(), randstrobes.end(), pdqsort_branchless);
marcelm commented 5 months ago

Cool, thanks! I opened #385 to track this. We’ll let you know how it goes.