Open ctb opened 1 year ago
I think, in this tool, it's doable to convert its output to a sourmash signature. It will only fail when working on any 'scale' related operation since hashing is different.
I would love to work or help creating a plugin that converts different input files to a sourmash sketch. KMC, kProcessor, etc..
(as long as it's a FracMinHash bottom sketch, the scaled stuff should work fine! it's all based on numbers not the specific hashing approach 🤷 )
cc @St4NNi :)
(as long as it's a FracMinHash bottom sketch, the scaled stuff should work fine! it's all based on numbers not the specific hashing approach 🤷 )
I see what you are saying, and yes! Thanks.
Hi everyone, and thanks for the shout out @ctb . jam was more or less born out of the same curiosity, but after taking a closer look at it, a lot of other ideas for tailoring minhash to some of our specific problems popped up.
Currently, jam is in a fairly early stage, and the output format has not yet settled on anything stable, but I am also curious about how different hashing algorithms will perform in sourmash, so I was thinking of adding an output parameter that creates sourmash-compatible sketches directly.
The only thing that would be a little odd is that sourmash::encodings::HashFunctions
has no real custom
option or similar and is not even #[non_exhaustive]
so for now the algorithm would need to pretend to be murmur64_DNA
The only thing that would be a little odd is that
sourmash::encodings::HashFunctions
has no realcustom
option or similar and is not even#[non_exhaustive]
so for now the algorithm would need to pretend to bemurmur64_DNA
That's a great point! I'm not saying you should lie, but... you can lie at the Signature level, but not at the Sketch
level, so supporting a Custom(String)
variant in HashFunctions
seems the way to go!
That's a great point! I'm not saying you should lie, but... you can lie at the Signature level, but not at the
Sketch
level, so supporting aCustom(String)
variant inHashFunctions
seems the way to go!
Sounds reasonable to me, in a first iteration I could lie on the Sketch
level (and tell the truth on Signature
level), as long as both sketches use the same alg
it should be fine.
Regarding the custom string on the HashFunctions
enum i don´t know if it would be that easy since it is #[repr(u32)]
and I'm afraid that changing this would cause all sorts of side-effects.
Leaving this as #[repr(u32)]
but adding a field would effectively make this enum non-primitive and result in something similar to #[repr(C)]
preventing any type casts with as
.
curious how well our underlying infrastructure can work for handling other kinds of fracminhash sequences! maybe could be explored using plugins.
https://github.com/St4NNi/jam-rs