Closed marcelm closed 1 year ago
I like point 3 the best. But pqsort(?) only works with comparison of single int types as you benchmarked before, right? So we would have a slowdown in sorting then.
Point 1 and 2 also sounds good, with slight preference for point 2. Do you know why the difference occurs? I though it was identical number of syncmers and randstrobes as we choose only the first strobe (syncmer) if no downstream syncmers within the sampling window, but maybe we have a different stop-condition. If the stop condition is different at ends in sequences, we should perhaps change so that the single syncmer strobe becomes the randstrobe. This is what we do if no syncmers are within the max_dist
seed size, for example due to a longer stretch of N's.
I like point 3 the best. But pqsort(?) only works with comparison of single int types as you benchmarked before, right? So we would have a slowdown in sorting then.
pdqsort would work just as before since it just uses <
when comparing two elements. We just need to change our definition of operator<
:
Point 1 and 2 also sounds good, with slight preference for point 2. Do you know why the difference occurs? I though it was identical number of syncmers and randstrobes as we choose only the first strobe (syncmer) if no downstream syncmers within the sampling window, but maybe we have a different stop-condition. If the stop condition is different at ends in sequences, we should perhaps change so that the single syncmer strobe becomes the randstrobe. This is what we do if no syncmers are within the
max_dist
seed size, for example due to a longer stretch of N's.
Good point, we could just ensure that there is no discrepancy anymore. I have spent a couple of minutes trying to figure out where it comes from, but I don’t know, yet. I’ll open an issue and look into this. Would be nice for consistency in any case.
Re pdqsort: I did not formulate myself well. I meant that it enjoys speedup over std::sort for simple comparisons, but you observed it had similar sorting speed to std::sort when tuples/structs were compared over multiple values. (a benchmark in a pull request you did a while back). Did I misunderstand it?
I have spent a couple of minutes trying to figure out where it comes from, but I don’t know, yet.
My guess would be in this method: https://github.com/ksahlin/strobealign/blob/main/src/randstrobes.cpp#L180 (perhaps in this if-statement https://github.com/ksahlin/strobealign/blob/main/src/randstrobes.cpp#L188 breaks early without producing a syncmer as seed in the ends depending on w_min).
Also, we have one RandstrobeIterator
and one RandstrobeIterator2
. I think the RandstrobeIterator2
is used for the index ans RandstrobeIterator
for the reads? Perhaps we should rename them if so?
Did I misunderstand it?
No, I had just forgotten. This is probably because pdqsort_branchless
doesn’t give as much of a speedup when the comparison function isn’t branchless. Maybe we can recover some of that speed if we manage to make it branchless.
I’ll open a PR fixing the discrepancy and also look into there being two RandstrobeIterator
s.
Before merging #304, that is, working towards option 3, I wanted to do some measurements so that we are aware of the consequences.
Indexing runtimes (in seconds) on CHM13 are as follows.
commit | what | counting | generating | sorting | indexing | total |
---|---|---|---|---|---|---|
e32475e | main , 1 thread |
63 | 70 | 32 | 4 | 169 |
e32475e | main , 8 threads |
14 | 71 | 32 | 4 | 120 |
b278ffd | PR #304, any no. of threads | 0 | 74 | 31 | 4 | 110 |
168763b | this PR, 8 threads | 14 | 18 | 31 | 4 | 67 |
I also measured the following:
So if we need to switch to the more expensive sorting and don’t parallelize it, we won’t go faster than 80 seconds, which is slower than the parallelization in this PR.
I’ll investigate whether we can make the sorting faster by avoiding branching in the sorting function.
Here’s a branchless comparison:
bool operator< (const RefRandstrobe& other) const {
uint64_t lower_this = (uint64_t(this->position) << 32) | this->m_packed;
uint64_t lower_other = (uint64_t(other.position) << 32) | other.m_packed;
unsigned __int128 this128 = __int128(this->hash) << 64 | lower_this;
unsigned __int128 other128 = __int128(other.hash) << 64 | lower_other;
return this128 < other128;
}
With this, sorting time goes up only a little, from 32 to 38 seconds. I’m not sure how portable the __int128
type is.
Great comparison table, but I must admit I am not following what we are working towards in this commit :)
What is your opinion from this benchmark?
Edit: Also, based on yesterday’s discussion and your comparison table, option 2 may be a decent solution.
Does 168763b estimate the number of seeds per contig and then allocate one large vector and write the seeds to certain predefined intervals in the flat vector (ordered by the reference contigs appering in the input file)? It is an impressive speedup.
Sorry about my ramblings, I had to leave before really finishing the post.
I fixed the commit link. Commit 168763b2c34b4592c3be5ad12a35560df9e033b9 is just the two-pass method that has been implemented as stated in the original description of this PR:
The first pass counts the randstrobes for each reference and then uses that to compute the total needed length of the randstrobes vector (as before) and also at which offset in the vector the randstrobes for a contig need to be stored. Then the vector is filled in parallel from multiple worker threads. They use the offset to write into the correct position in the vector.
Ignoring options 1 to 3 from above for the moment, we have two immediate alternatives: a) Merge PR #304, which speeds up index generation by ~10-15% (120s → 110s) b) Merge this PR, which speeds up index generation by ~80% (120s → 67s). By counting only syncmers, this can be easily reduced to ~60s.
Option a) removes the separate counting step that b) relies on, which means that these alternatives are mutually exclusive as-is. The question is whether it is possible to have the cake and eat it, too. That is, how to parallelize while still removing the counting step.
Does 168763b estimate the number of seeds per contig and then allocate one large vector and write the seeds to certain predefined intervals in the flat vector (ordered by the reference contigs appering in the input file)?
Yes, except that it does not estimate, but that it counts the number of seeds.
Ah okay, now I get it! looks like this PR (option b) is favourable then. Or what do you think?
looks like this PR (option b) is favourable then. Or what do you think?
I kind of think so, yes. Shall I merge this and prepare the release?
I kind of think so, yes. Shall I merge this and prepare the release?
Yes, sounds good!
This was surprisingly easy to implement with the new memory layout. However, this may not be the way to go and I’m opening this PR mainly to document the experiment.
The approach here still requires two passes over the references: The first pass counts the randstrobes for each reference and then uses that to compute the total needed length of the randstrobes vector (as before) and also at which offset in the vector the randstrobes for a contig need to be stored. Then the vector is filled in parallel from multiple worker threads. They use the offset to write into the correct position in the vector.
This works, but the speedup is only ~2X because the other steps of index generation (mainly sorting) is not parallel. Also, we still need the first randstrobe counting pass, whereas we could maybe get a similar speedup by getting rid of it.
Here are some options for parallelization
std::priority_queue
).RefRandstrobe
(as suggested by @luispedro).