Compute strobemers in parallel

marcelm commented 1 year ago

This was surprisingly easy to implement with the new memory layout. However, this may not be the way to go and I’m opening this PR mainly to document the experiment.

The approach here still requires two passes over the references: The first pass counts the randstrobes for each reference and then uses that to compute the total needed length of the randstrobes vector (as before) and also at which offset in the vector the randstrobes for a contig need to be stored. Then the vector is filled in parallel from multiple worker threads. They use the offset to write into the correct position in the vector.

This works, but the speedup is only ~2X because the other steps of index generation (mainly sorting) is not parallel. Also, we still need the first randstrobe counting pass, whereas we could maybe get a similar speedup by getting rid of it.

Here are some options for parallelization

We could keep this approach and also sort randstrobes for each contig in parallel; then do a $k$-way merge sort afterwards in order to get the final sorted randstrobes vector (needs a heap/std::priority_queue).
We speed up the counting by counting only syncmers, not randstrobes. In theory, the numbers should be identical and computing syncmers is faster, but there is a slight discrepancy which one would need to investigate. The no. of randstrobes is not higher than the number of syncmers, however, so it already works as an upper bound. One could reserve space using the number of syncmers and then move entries around in the vector so that there are no empty slots ("defragment").
We abandon the counting and use an estimate to reserve enough elements (this is #304). Then we let each thread compute randstrobes in chunks and add those chunks to the vector in whichever order they come in. Since the order of randstrobes then is no longer deterministic, we change sorting so that it no longer compares just the hashes, but all members of RefRandstrobe (as suggested by @luispedro).

ksahlin commented 1 year ago

I like point 3 the best. But pqsort(?) only works with comparison of single int types as you benchmarked before, right? So we would have a slowdown in sorting then.

Point 1 and 2 also sounds good, with slight preference for point 2. Do you know why the difference occurs? I though it was identical number of syncmers and randstrobes as we choose only the first strobe (syncmer) if no downstream syncmers within the sampling window, but maybe we have a different stop-condition. If the stop condition is different at ends in sequences, we should perhaps change so that the single syncmer strobe becomes the randstrobe. This is what we do if no syncmers are within the max_dist seed size, for example due to a longer stretch of N's.

marcelm commented 1 year ago

I like point 3 the best. But pqsort(?) only works with comparison of single int types as you benchmarked before, right? So we would have a slowdown in sorting then.

pdqsort would work just as before since it just uses < when comparing two elements. We just need to change our definition of operator<:

https://github.com/ksahlin/strobealign/blob/70d2e7ba288247b835eba18601717ca38e78935a/src/randstrobes.hpp#L32-L34

Point 1 and 2 also sounds good, with slight preference for point 2. Do you know why the difference occurs? I though it was identical number of syncmers and randstrobes as we choose only the first strobe (syncmer) if no downstream syncmers within the sampling window, but maybe we have a different stop-condition. If the stop condition is different at ends in sequences, we should perhaps change so that the single syncmer strobe becomes the randstrobe. This is what we do if no syncmers are within the max_dist seed size, for example due to a longer stretch of N's.

Good point, we could just ensure that there is no discrepancy anymore. I have spent a couple of minutes trying to figure out where it comes from, but I don’t know, yet. I’ll open an issue and look into this. Would be nice for consistency in any case.

ksahlin commented 1 year ago

Re pdqsort: I did not formulate myself well. I meant that it enjoys speedup over std::sort for simple comparisons, but you observed it had similar sorting speed to std::sort when tuples/structs were compared over multiple values. (a benchmark in a pull request you did a while back). Did I misunderstand it?

I have spent a couple of minutes trying to figure out where it comes from, but I don’t know, yet.

My guess would be in this method: https://github.com/ksahlin/strobealign/blob/main/src/randstrobes.cpp#L180 (perhaps in this if-statement https://github.com/ksahlin/strobealign/blob/main/src/randstrobes.cpp#L188 breaks early without producing a syncmer as seed in the ends depending on w_min).

Also, we have one RandstrobeIterator and one RandstrobeIterator2. I think the RandstrobeIterator2 is used for the index ans RandstrobeIterator for the reads? Perhaps we should rename them if so?

marcelm commented 1 year ago

Did I misunderstand it?

No, I had just forgotten. This is probably because pdqsort_branchless doesn’t give as much of a speedup when the comparison function isn’t branchless. Maybe we can recover some of that speed if we manage to make it branchless.

I’ll open a PR fixing the discrepancy and also look into there being two RandstrobeIterators.

marcelm commented 1 year ago

Before merging #304, that is, working towards option 3, I wanted to do some measurements so that we are aware of the consequences.

Indexing runtimes (in seconds) on CHM13 are as follows.

commit	what	counting	generating	sorting	indexing	total
e32475e	`main`, 1 thread	63	70	32	4	169
e32475e	`main`, 8 threads	14	71	32	4	120
b278ffd	PR #304, any no. of threads	0	74	31	4	110
168763b	this PR, 8 threads	14	18	31	4	67

I also measured the following:

Counting only syncmers instead of randstrobes takes 40% less time. That means runtime for this PR would decrease further to ~61 seconds.
The sorting step takes 80 (!) instead of 32 seconds when sorting by all RefRandstrobe attributes instead of only by hash.

So if we need to switch to the more expensive sorting and don’t parallelize it, we won’t go faster than 80 seconds, which is slower than the parallelization in this PR.

I’ll investigate whether we can make the sorting faster by avoiding branching in the sorting function.

marcelm commented 1 year ago

Here’s a branchless comparison:

    bool operator< (const RefRandstrobe& other) const {
        uint64_t lower_this  = (uint64_t(this->position) << 32) | this->m_packed;
        uint64_t lower_other = (uint64_t(other.position) << 32) | other.m_packed;

        unsigned __int128 this128 = __int128(this->hash) << 64 | lower_this;
        unsigned __int128 other128 = __int128(other.hash) << 64 | lower_other;
        return this128 < other128;
    }

With this, sorting time goes up only a little, from 32 to 38 seconds. I’m not sure how portable the __int128 type is.

ksahlin commented 1 year ago

Great comparison table, but I must admit I am not following what we are working towards in this commit :)

What is implemented in "this PR, 8 threads"? (commit link broken)
Assuming commit 168763b is option 3, don't we need to sort on multiple values in that case to assert identical output? (i.e. it would not be 67secs)

What is your opinion from this benchmark?

Edit: Also, based on yesterday’s discussion and your comparison table, option 2 may be a decent solution.

ksahlin commented 1 year ago

Does 168763b estimate the number of seeds per contig and then allocate one large vector and write the seeds to certain predefined intervals in the flat vector (ordered by the reference contigs appering in the input file)? It is an impressive speedup.

marcelm commented 1 year ago

Sorry about my ramblings, I had to leave before really finishing the post.

I fixed the commit link. Commit 168763b2c34b4592c3be5ad12a35560df9e033b9 is just the two-pass method that has been implemented as stated in the original description of this PR:

The first pass counts the randstrobes for each reference and then uses that to compute the total needed length of the randstrobes vector (as before) and also at which offset in the vector the randstrobes for a contig need to be stored. Then the vector is filled in parallel from multiple worker threads. They use the offset to write into the correct position in the vector.

Ignoring options 1 to 3 from above for the moment, we have two immediate alternatives: a) Merge PR #304, which speeds up index generation by ~10-15% (120s → 110s) b) Merge this PR, which speeds up index generation by ~80% (120s → 67s). By counting only syncmers, this can be easily reduced to ~60s.

Option a) removes the separate counting step that b) relies on, which means that these alternatives are mutually exclusive as-is. The question is whether it is possible to have the cake and eat it, too. That is, how to parallelize while still removing the counting step.

Does 168763b estimate the number of seeds per contig and then allocate one large vector and write the seeds to certain predefined intervals in the flat vector (ordered by the reference contigs appering in the input file)?

Yes, except that it does not estimate, but that it counts the number of seeds.

ksahlin commented 1 year ago

Ah okay, now I get it! looks like this PR (option b) is favourable then. Or what do you think?

marcelm commented 1 year ago

looks like this PR (option b) is favourable then. Or what do you think?

I kind of think so, yes. Shall I merge this and prepare the release?

ksahlin commented 1 year ago

I kind of think so, yes. Shall I merge this and prepare the release?

Yes, sounds good!

ksahlin / strobealign

Compute strobemers in parallel #307