ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Make strobealign work with large references #306

Closed marcelm closed 1 year ago

marcelm commented 1 year ago

(Marking as draft because this is untested.)

This unconditionally switches the bucket start indices from 32 to 64 bit. This makes strobealign work with more than $2^{32}$ strobemers, but also doubles memory usage of the index vector.

Closes #277 Closes #285

ksahlin commented 1 year ago

As i just commented in https://github.com/ksahlin/strobealign/issues/305 , I think it is a great idea until templates are implemented.

marcelm commented 1 year ago

During testing, I found a couple of other types that needed fixing, but it works now!

I used the 100 bp single-end rye dataset for testing.

With the default settings, the no. of strobemers is below 2 billion and we get a mapping rate of 99.2582% and accuracy of 69.2576%. To test this PR, I used the same dataset, but with options -k 22 -s 22, which results in over 7 billion strobemers and needs 140 GB of RAM. The mapping rate was 99.2639% and accuracy 70.1665%.

I think this is a good indication that there’s nothing fundementally wrong with the PR, so I’d say we should merge it and make a release (next week).

ksahlin commented 1 year ago

Sounds great. I will also test a version on the standard benchmark that I typically do before release, just let me know a commit and can start it whenever we have something we believe is release-ready.

marcelm commented 1 year ago

Do you want to run the test before or after merging?

ksahlin commented 1 year ago

I could test after merging.

ksahlin commented 1 year ago

I can also compare to a version using XXHASH for s-mers if you want. But I don't see it in the commit network, maybe you didn’t push it yet?

marcelm commented 1 year ago

Ok, I’ll merge this now and then push a branch with the XXH changes.

marcelm commented 1 year ago

The syncmer XXH change is in branch syncmerhash (commit d8dc2562efae30d622dc8e83140a20b9cb14a5fd).

ksahlin commented 1 year ago

I have benchmarked current develop (commit e0764b6, Name v0110), and the XXH change (commit https://github.com/ksahlin/strobealign/commit/d8dc2562efae30d622dc8e83140a20b9cb14a5fd , name XXH). See attached plots.

Main points:

  1. Memory consumption goes down considerably in v0.11.0 (but we knew that)
  2. The new index seems to be also slightly, but consistently, faster for short read lengths (~50-125nt) on the three larger references.
  3. XXH branch does not introduce a big computational overhead, it is nearly as fast as 'v0.11.0' (we should consider including it in the release)
  4. I don't observe as high benefits using XXH as you did. Maybe because I am using the old binary overlap comparison script, while you are using the Jaccard based evaluation script? I get 0.15% increase in mapping accuracy for CHM , 0.17% for Maize, and 0.28% for rye for 50nt paired-end reads. The mapping-only (no extension) accuracy increases considerably though.
  5. Mapping rate also goes up significantly for 50nt reads with XXH compared to other versions.

memory_plot.pdf time_plot.pdf accuracy_plot.pdf percentage_aligned_plot.pdf

ksahlin commented 1 year ago

I forgot to mention the indexing speed of the two latest versions (see below for 150nt reads, sorry for the formatting).

In summary, both offer significant speedup over previous versions. XXH is 7-15% slower.

150bp v0.11.0
Maize  Total time indexing: 92.44 s
CHM Total time indexing: 137.94 s
Rye Total time indexing: 343.51 s

150bp XXH 
Maize Total time indexing: 106.39 s
CHM Total time indexing: 153.05 s
Rye Total time indexing: 368.90 s
marcelm commented 1 year ago

4. I don't observe as high benefits using XXH as you did. Maybe because I am using the old binary overlap comparison script, while you are using the Jaccard based evaluation script? I get 0.15% increase in mapping accuracy for CHM , 0.17% for Maize, and 0.28% for rye for 50nt paired-end reads. The mapping-only (no extension) accuracy increases considerably though.

I usually don’t report the Jaccard-based numbers at the moment because I want the numbers to be comparable to what your evaluation script computes. I think it was important for the soft-clipping evaluation to use them, but for many other algorithm changes, it doesn’t matter that much.

The difference is probably because I reported numbers for single-end read mapping. I usually test on single reads because it is faster and because I want to see the effect without mate-based rescue. That exaggerates differences a bit.

ksahlin commented 1 year ago

I see, makes sense, thanks! Still worth to change to XXH produced s-mers in my opinion.