marbl / Mash

Fast genome and metagenome distance estimation using MinHash
mash.readthedocs.org
Other
389 stars 91 forks source link

Exporting mash hashes for interoperability #27

Open ctb opened 8 years ago

ctb commented 8 years ago

Hi all,

first some background:

it seems like sourmash is going to be a thing; I'm building it into a metagenomics data exploration tool, and it's already integrated into https://github.com/dib-lab/khmer/ in some interesting and useful ways. Before it becomes too much of a thing, I'm interested in harmonizing with what you've done with mash, both out of gratitude and because it'd be kind of stupid to have multiple different MinHash implementations out there - interoperability would be really handy!

So, on the topic of interop, I poked around under the hood of mash, and am happy to report that I can swizzle sourmash over to use your exact hash function and seed; I will do so forthwith.

It seems like it would be relatively simple for me to write a parser for your .msh files, but that would depend on capnproto, I think. It seems like it would be better to be part of mash. So, what do you think about a 'dump' command for sketches? This would be an explicit "data transfer" format that we could use to transition sketches between MinHash software implementations. I'd guess that something quite minimal (uniquely identified hash function + seed, k size, identifier, and hashes, all in a CSV file) would work. In our 'signature' files we also include an md5sum of the hashes.

If this is not antithetical to the very principles on which mash was founded, then great! Let me know! And I'm happy to whip up a prototype and submit a pull request - I was thinking of adding a new command, 'mash dump'. Alternate ideas very welcome.

cc @luizirber

kescobo commented 7 years ago

Thanks for the rapid response everyone - happily there appears to be a julia implementation of murmur3, so I'll give that a shot and compare to @ctb 's python implementation.

@ondovb - thank you also for specifically mentioning the fact that it's an ASCII string - julia uses utf8 strings by default, and I likely would have spent a long time banging my head against that difference if you hadn't brought it up.

cc @edawson