Open bluegenes opened 1 week ago
Comparing try-skipmers
(d7f59cf) with latest
(94d37f9)
✅ 21
untouched benchmarks
Attention: Patch coverage is 80.95238%
with 8 lines
in your changes missing coverage. Please review.
Project coverage is 86.45%. Comparing base (
94d37f9
) to head (d7f59cf
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
🚨 Try these New Features:
Hi Tessa, Will you be allowing user-defined n, m,k here? And will you decide to construct the Skipmer after accepting the hash value range or you will construct all Skipmers then hash them and either accept or skip?
Hi Tessa, Will you be allowing user-defined n, m,k here?
Hi Mo! At the moment I just got the basics working, and am testing it out over in branchwater. Is there a strong reason for flexible n,m?
And will you decide to construct the skipmer after accepting the hash value range or you will construct all Skipmers then hash them and either accept or skip?
By "hash value range" do you mean the FracMinHash selection process (i.e. max hash)? I'm just using the SeqToHashes approach, so my reading of it is that we construct all, then add if it meets the threshold.
Do you have something more efficient/flexible already implemented? And/or what things do you think would be useful here?
Hi Mo! At the moment I just got the basics working, and am testing it out over in branchwater. Is there a strong reason for flexible n,m?
Not a strong one. Adding skipmers to sourmash sketching is an excellent addition. So, having the flexibility to change n,m, and k would be good for changing the dispersity/contiguity of the extracted skipmers and, therefore, helping in different applications.
By "hash value range" do you mean the FracMinHash selection process (i.e. max hash)? I'm just using the SeqToHashes approach, so my reading of it is that we construct all, then add if it meets the threshold.
Gotcha! Just expect that small n,m will have a noticeable slowdown in sketching time.
Do you have something more efficient/flexible already implemented? And/or what things do you think would be useful here?
Not really flexible in that context, but you might find the skipmers implementation in kmerDecoder helpful. https://github.com/dib-lab/kmerDecoder/blob/master/src/KD_skipmers.cpp and this very old example: https://github.com/mr-eyes/OLD_kmerDecoder/blob/d5eb475875ecbe1f3440e1448a56a5ab3b1984fc/python_preview/skipmers.ipynb
Hi Mo! At the moment I just got the basics working, and am testing it out over in branchwater. Is there a strong reason for flexible n,m?
Not a strong one. Adding skipmers to sourmash sketching is an excellent addition. So, having the flexibility to change n,m, and k would be good for changing the dispersity/contiguity of the extracted skipmers and, therefore, helping in different applications.
Got it. I think this shouldn't be too hard if we get a good implementation in, and we could have users specify m=
and n=
in the param string. I think I'll probably leave this to the future, but I can try to add m
,n
variables in to make future changes easier.
By "hash value range" do you mean the FracMinHash selection process (i.e. max hash)? I'm just using the SeqToHashes approach, so my reading of it is that we construct all, then add if it meets the threshold.
Gotcha! Just expect that small n,m will have a noticeable slowdown in sketching time.
Good point. I haven't done any thinking about optimization yet.
Do you have something more efficient/flexible already implemented? And/or what things do you think would be useful here?
Not really flexible in that context, but you might find the skipmers implementation in kmerDecoder helpful. https://github.com/dib-lab/kmerDecoder/blob/master/src/KD_skipmers.cpp and this very old example: https://github.com/mr-eyes/OLD_kmerDecoder/blob/d5eb475875ecbe1f3440e1448a56a5ab3b1984fc/python_preview/skipmers.ipynb
After reading your implementation, it seems that the main difference is that you take the entire sequence and skipmerize it (remove the skipped bases), then take k-mers/hashes from that sequence as usual. Is that right? Was that a pretty significant speedup compared with just generating skipmers as you go?
Got it. I think this shouldn't be too hard if we get a good implementation in, and we could have users specify
m=
andn=
in the param string. I think I'll probably leave this to the future, but I can try to addm
,n
variables in to make future changes easier.
Parameterizing it should be an ideal solution, yes! Thank you!
After reading your implementation, it seems that the main difference is that you take the entire sequence and skipmerize it (remove the skipped bases), then take k-mers/hashes from that sequence as usual. Is that right? Was that a pretty significant speedup compared with just generating skipmers as you go?
I haven't documented any benchmark here, but I believe I did it that way for performance.
Parameterizing it should be an ideal solution, yes! Thank you!
Hey Mo! Reading through the 2017 skipmer paper again, they note that triplet (n=3) skipmer patterns performed best, namely m=2,n=3 and m=1,n=3. Do you have a good argument for allowing more patterns than that?
I'm using the hashfunctions (moltype) enum to build sketches and ensure that only compatible sketches are comparable down the road. We don't want incompatible skipmer sketches to be compared. I think unless we currently have evidence to show other combos are useful, I may just enable these two and make them two enums, e.g. Murmur64Skipm2n3
and Murmur64Skipm1n3
or similar. I don't think there's any reason we couldn't add more later.
Open to other ideas, though!
Parameterizing it should be an ideal solution, yes! Thank you!
Hey Mo! Reading through the 2017 skipmer paper again, they note that triplet (n=3) skipmer patterns performed best, namely m=2,n=3 and m=1,n=3. Do you have a good argument for allowing more patterns than that?
I'm using the hashfunctions (moltype) enum to build sketches and ensure that only compatible sketches are comparable down the road. We don't want incompatible skipmer sketches to be compared. I think unless we currently have evidence to show other combos are useful, I may just enable these two and make them two enums, e.g.
Murmur64Skipm2n3
andMurmur64Skipm1n3
or similar. I don't think there's any reason we couldn't add more later.Open to other ideas, though!
I don't really have a use case in mind for different configurations. So this is good enough for the implementation, and as you said, we could add more later if needed.
try out skipmer generation