ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Avoid counting the seeds #340

Open marcelm opened 9 months ago

marcelm commented 9 months ago

Strobealign currently does two passes over the reference to generate the randstrobes vector: First it counts how many randstrobes there are for each scaffold (parallelized by scaffold) in order to allocate a randstrobes vector of the exact size and also to know at which offset the randstrobes for a scaffold should start in it. Then it fills each of these slices of the vector in the second pass, also parallelized by scaffold.

The idea of #314 was to not count the randstrobes, but to estimate their number. The PR now no longer works because we need the exact offsets for the above to work.

Here is a suggestion for how to avoid the counting step and still be able to generate seeds in parallel. (I think this has lower priority, but wanted to note down the idea somewhere before closing #314.)

The idea would be to estimate the number of randstrobes for each individual scaffold and fill the vector based on that. We need to over-estimate the number (as in #314) so that we can be quite sure that the randstrobes fit. However, since the number of scaffolds can be large, the probability that at least one of the estimates is too low gets higher. So we need to ensure that each thread doesn’t overfill the slice of the vector it is supposed to fill and then have a strategy of what to do with the extra randstrobes that don’t fit. Initially, we could just omit them. But since this should happen very rarely, we can also do something expensive, such as writing them to a mutex-protected "overflow" vector.

Most of the time, too few randstrobes would be generated. The leftover space in the vector could be filled with dummy randstrobes with a hash value of uint64_t(-1) that then get moved to the end of the vector during sorting, and then the vector can be shortened appropriately.

As a point of reference, here are current timings for generating the index for CHM13 using 8 threads:

  Time counting seeds: 12.41 s
  Time generating seeds: 20.07 s
  Time sorting seeds: 26.05 s
  Time generating hash table index: 4.06 s
Total time indexing: 62.60 s

So we could reduce indexing time from 63 to 50 seconds.

As I said, I don’t think trying this out should have a high priority as indexing is so fast anyway and the actual bottleneck at the moment is sorting (which is single threaded).

ksahlin commented 9 months ago

Nice idea! I think your solution makes sense. Furthermore, I think it should not obstruct any concurrent and future plans for the indexing.

However, I also agree that it has lower priority than some of the other stuff we have discussed.