sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
471 stars 80 forks source link

import other kinds of fracminhash etc? #2710

Open ctb opened 1 year ago

ctb commented 1 year ago

curious how well our underlying infrastructure can work for handling other kinds of fracminhash sequences! maybe could be explored using plugins.

https://github.com/St4NNi/jam-rs

mr-eyes commented 1 year ago

I think, in this tool, it's doable to convert its output to a sourmash signature. It will only fail when working on any 'scale' related operation since hashing is different.

mr-eyes commented 1 year ago

I would love to work or help creating a plugin that converts different input files to a sourmash sketch. KMC, kProcessor, etc..

ctb commented 1 year ago

(as long as it's a FracMinHash bottom sketch, the scaled stuff should work fine! it's all based on numbers not the specific hashing approach 🤷 )

ctb commented 1 year ago

cc @St4NNi :)

mr-eyes commented 1 year ago

(as long as it's a FracMinHash bottom sketch, the scaled stuff should work fine! it's all based on numbers not the specific hashing approach 🤷 )

I see what you are saying, and yes! Thanks.

St4NNi commented 1 year ago

Hi everyone, and thanks for the shout out @ctb . jam was more or less born out of the same curiosity, but after taking a closer look at it, a lot of other ideas for tailoring minhash to some of our specific problems popped up.

Currently, jam is in a fairly early stage, and the output format has not yet settled on anything stable, but I am also curious about how different hashing algorithms will perform in sourmash, so I was thinking of adding an output parameter that creates sourmash-compatible sketches directly.

The only thing that would be a little odd is that sourmash::encodings::HashFunctions has no real custom option or similar and is not even #[non_exhaustive] so for now the algorithm would need to pretend to be murmur64_DNA

luizirber commented 1 year ago

The only thing that would be a little odd is that sourmash::encodings::HashFunctions has no real custom option or similar and is not even #[non_exhaustive] so for now the algorithm would need to pretend to be murmur64_DNA

That's a great point! I'm not saying you should lie, but... you can lie at the Signature level, but not at the Sketch level, so supporting a Custom(String) variant in HashFunctions seems the way to go!

St4NNi commented 1 year ago

That's a great point! I'm not saying you should lie, but... you can lie at the Signature level, but not at the Sketch level, so supporting a Custom(String) variant in HashFunctions seems the way to go!

Sounds reasonable to me, in a first iteration I could lie on the Sketch level (and tell the truth on Signature level), as long as both sketches use the same alg it should be fine.

Regarding the custom string on the HashFunctions enum i don´t know if it would be that easy since it is #[repr(u32)] and I'm afraid that changing this would cause all sorts of side-effects.

Leaving this as #[repr(u32)] but adding a field would effectively make this enum non-primitive and result in something similar to #[repr(C)] preventing any type casts with as.