sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
476 stars 79 forks source link

EXP: skipmers #3395

Open bluegenes opened 1 week ago

bluegenes commented 1 week ago

try out skipmer generation

codspeed-hq[bot] commented 1 week ago

CodSpeed Performance Report

Merging #3395 will not alter performance

Comparing try-skipmers (d7f59cf) with latest (94d37f9)

Summary

✅ 21 untouched benchmarks

codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 80.95238% with 8 lines in your changes missing coverage. Please review.

Project coverage is 86.45%. Comparing base (94d37f9) to head (d7f59cf).

Files with missing lines Patch % Lines
src/core/src/signature.rs 83.33% 5 Missing :warning:
src/core/src/encodings.rs 33.33% 2 Missing :warning:
src/core/src/sketch/minhash.rs 0.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## latest #3395 +/- ## ========================================== - Coverage 86.45% 86.45% -0.01% ========================================== Files 137 137 Lines 16100 16141 +41 Branches 2219 2219 ========================================== + Hits 13920 13955 +35 - Misses 1873 1879 +6 Partials 307 307 ``` | [Flag](https://app.codecov.io/gh/sourmash-bio/sourmash/pull/3395/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=sourmash-bio) | Coverage Δ | | |---|---|---| | [hypothesis-py](https://app.codecov.io/gh/sourmash-bio/sourmash/pull/3395/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=sourmash-bio) | `25.43% <ø> (ø)` | | | [python](https://app.codecov.io/gh/sourmash-bio/sourmash/pull/3395/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=sourmash-bio) | `92.40% <ø> (ø)` | | | [rust](https://app.codecov.io/gh/sourmash-bio/sourmash/pull/3395/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=sourmash-bio) | `62.56% <80.95%> (+0.29%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=sourmash-bio#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.


🚨 Try these New Features:

mr-eyes commented 2 days ago

Hi Tessa, Will you be allowing user-defined n, m,k here? And will you decide to construct the Skipmer after accepting the hash value range or you will construct all Skipmers then hash them and either accept or skip?

bluegenes commented 2 days ago

Hi Tessa, Will you be allowing user-defined n, m,k here?

Hi Mo! At the moment I just got the basics working, and am testing it out over in branchwater. Is there a strong reason for flexible n,m?

And will you decide to construct the skipmer after accepting the hash value range or you will construct all Skipmers then hash them and either accept or skip?

By "hash value range" do you mean the FracMinHash selection process (i.e. max hash)? I'm just using the SeqToHashes approach, so my reading of it is that we construct all, then add if it meets the threshold.

Do you have something more efficient/flexible already implemented? And/or what things do you think would be useful here?

mr-eyes commented 2 days ago

Hi Mo! At the moment I just got the basics working, and am testing it out over in branchwater. Is there a strong reason for flexible n,m?

Not a strong one. Adding skipmers to sourmash sketching is an excellent addition. So, having the flexibility to change n,m, and k would be good for changing the dispersity/contiguity of the extracted skipmers and, therefore, helping in different applications.

By "hash value range" do you mean the FracMinHash selection process (i.e. max hash)? I'm just using the SeqToHashes approach, so my reading of it is that we construct all, then add if it meets the threshold.

Gotcha! Just expect that small n,m will have a noticeable slowdown in sketching time.

Do you have something more efficient/flexible already implemented? And/or what things do you think would be useful here?

Not really flexible in that context, but you might find the skipmers implementation in kmerDecoder helpful. https://github.com/dib-lab/kmerDecoder/blob/master/src/KD_skipmers.cpp and this very old example: https://github.com/mr-eyes/OLD_kmerDecoder/blob/d5eb475875ecbe1f3440e1448a56a5ab3b1984fc/python_preview/skipmers.ipynb

bluegenes commented 1 day ago

Hi Mo! At the moment I just got the basics working, and am testing it out over in branchwater. Is there a strong reason for flexible n,m?

Not a strong one. Adding skipmers to sourmash sketching is an excellent addition. So, having the flexibility to change n,m, and k would be good for changing the dispersity/contiguity of the extracted skipmers and, therefore, helping in different applications.

Got it. I think this shouldn't be too hard if we get a good implementation in, and we could have users specify m= and n= in the param string. I think I'll probably leave this to the future, but I can try to add m,n variables in to make future changes easier.

By "hash value range" do you mean the FracMinHash selection process (i.e. max hash)? I'm just using the SeqToHashes approach, so my reading of it is that we construct all, then add if it meets the threshold.

Gotcha! Just expect that small n,m will have a noticeable slowdown in sketching time.

Good point. I haven't done any thinking about optimization yet.

Do you have something more efficient/flexible already implemented? And/or what things do you think would be useful here?

Not really flexible in that context, but you might find the skipmers implementation in kmerDecoder helpful. https://github.com/dib-lab/kmerDecoder/blob/master/src/KD_skipmers.cpp and this very old example: https://github.com/mr-eyes/OLD_kmerDecoder/blob/d5eb475875ecbe1f3440e1448a56a5ab3b1984fc/python_preview/skipmers.ipynb

After reading your implementation, it seems that the main difference is that you take the entire sequence and skipmerize it (remove the skipped bases), then take k-mers/hashes from that sequence as usual. Is that right? Was that a pretty significant speedup compared with just generating skipmers as you go?

mr-eyes commented 1 day ago

Got it. I think this shouldn't be too hard if we get a good implementation in, and we could have users specify m= and n= in the param string. I think I'll probably leave this to the future, but I can try to add m,n variables in to make future changes easier.

Parameterizing it should be an ideal solution, yes! Thank you!

After reading your implementation, it seems that the main difference is that you take the entire sequence and skipmerize it (remove the skipped bases), then take k-mers/hashes from that sequence as usual. Is that right? Was that a pretty significant speedup compared with just generating skipmers as you go?

I haven't documented any benchmark here, but I believe I did it that way for performance.

bluegenes commented 11 hours ago

Parameterizing it should be an ideal solution, yes! Thank you!

Hey Mo! Reading through the 2017 skipmer paper again, they note that triplet (n=3) skipmer patterns performed best, namely m=2,n=3 and m=1,n=3. Do you have a good argument for allowing more patterns than that?

I'm using the hashfunctions (moltype) enum to build sketches and ensure that only compatible sketches are comparable down the road. We don't want incompatible skipmer sketches to be compared. I think unless we currently have evidence to show other combos are useful, I may just enable these two and make them two enums, e.g. Murmur64Skipm2n3 and Murmur64Skipm1n3 or similar. I don't think there's any reason we couldn't add more later.

Open to other ideas, though!

mr-eyes commented 10 hours ago

Parameterizing it should be an ideal solution, yes! Thank you!

Hey Mo! Reading through the 2017 skipmer paper again, they note that triplet (n=3) skipmer patterns performed best, namely m=2,n=3 and m=1,n=3. Do you have a good argument for allowing more patterns than that?

I'm using the hashfunctions (moltype) enum to build sketches and ensure that only compatible sketches are comparable down the road. We don't want incompatible skipmer sketches to be compared. I think unless we currently have evidence to show other combos are useful, I may just enable these two and make them two enums, e.g. Murmur64Skipm2n3 and Murmur64Skipm1n3 or similar. I don't think there's any reason we couldn't add more later.

Open to other ideas, though!

I don't really have a use case in mind for different configurations. So this is good enough for the implementation, and as you said, we could add more later if needed.