richarddurbin / modimizer

a toolset for fast DNA read set matching and assembly using a new type of reduced kmer
37 stars 2 forks source link

some thoughts on modulo hash & modimizer, and connections with sourmash 'scaled' signatures #1

Open ctb opened 5 years ago

ctb commented 5 years ago

hi @richarddurbin, @luizirber pointed me at this repo and I wanted to drop you a note --

we have been using an analogous concept in sourmash, that is identical in concept to modulo hash but uses a slightly different technical approach. (modulo hash was coined by Broder in his 1997 paper on MinHash; see also the mash screen blog post from Adam Phillippy.)

briefly, in sourmash we choose a max_hash value below which all hash values are kept; this max_hash is the inverse of the density, which we refer to as the scaled value. (max_hash = 2**64/scaled) This gives us slightly more options for resolution than modulo hash, and also has the advantage of being interconvertible with mash-style MinHash approaches.

we have this set of notes on the advantage of modulo hash, as well as some API documentation that goes further into depth on interconvertibility.

our code is not so pretty but here is the max_hash implementation.

we have a draft f1000research software paper nearly written that I'd be happy to send you if you want to see more polished prose :)

happy to discuss more!

other than saying hi, the main purpose of this issue is to recommend that you investigate (or at least support) mash compatibility by using murmurhash hashing with a seed of 42 and reverse complement handling. see our example Python code here. murmurhash is less efficient than rolling hash but on the other hand it's nice to be able to interconvert between these efforts!

richarddurbin commented 5 years ago

Hello Titus,

Good to hear from you. Yes, I am not surprised this is a standard idea. Thanks for the notes. Sergey Koren also pointed me to some references. I can see that it is effectively equivalent to a fixed hash value cutoff, rather than selecting a fraction of the hash values.

I also make my hash reverse-complement independent, in the same way as others by taking the min of the hash of the kmer and its reverse-complement, but I also return whether it was the min or not. This allows me to keep track of whether chains of hash hits are all in the right direction as well as co-linear.

Richard

On 3 Feb 2019, at 18:08, C. Titus Brown notifications@github.com wrote:

hi @richarddurbin https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_richarddurbin&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=tLrRui-nzNUodb5ShCAULA&m=dWoaB_dLvbwIyJ3o4QDQ0ITGIFbpsLNoOfBtI0wHCSw&s=Z6OVI8KGuTC9E4qeDESUTSHXfZan3nykS3PMlXD2dyk&e=, @luizirber https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_luizirber&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=tLrRui-nzNUodb5ShCAULA&m=dWoaB_dLvbwIyJ3o4QDQ0ITGIFbpsLNoOfBtI0wHCSw&s=cnSNI-l5TOFgXmpBEPE6PdCkvsd0_eMyRTbhUu35FT8&e= pointed me at this repo and I wanted to drop you a note --

we have been using an analogous concept in sourmash https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dib-2Dlab_sourmash_&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=tLrRui-nzNUodb5ShCAULA&m=dWoaB_dLvbwIyJ3o4QDQ0ITGIFbpsLNoOfBtI0wHCSw&s=5TKVlqNijRl1lh0dHnswqHQ6PxgySUt-Z_35R0ujJiA&e=, that is identical in concept to modulo hash but uses a slightly different technical approach. (modulo hash was coined by Broder in his 1997 paper on MinHash; see also the mash screen blog post https://urldefense.proofpoint.com/v2/url?u=https-3A__genomeinformatics.github.io_mash-2Dscreen_&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=tLrRui-nzNUodb5ShCAULA&m=dWoaB_dLvbwIyJ3o4QDQ0ITGIFbpsLNoOfBtI0wHCSw&s=I6PwSlEIyjNybjKE9iH3P3AnO3RwOjFB7qCaMUap3Fo&e= from Adam Phillippy.)

briefly, in sourmash we choose a max_hash value below which all hash values are kept; this max_hash is the inverse of the density, which we refer to as the scaled value. (max_hash = 2**64/scaled) This gives us slightly more options for resolution than modulo hash, and also has the advantage of being interconvertible with mash-style MinHash approaches.

we have this set of notes https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dib-2Dlab_sourmash_issues_606&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=tLrRui-nzNUodb5ShCAULA&m=dWoaB_dLvbwIyJ3o4QDQ0ITGIFbpsLNoOfBtI0wHCSw&s=u8u9BevRLZiNsi-AMUrOVPuaqIPto03vc_tHlc3X6n8&e= on the advantage of modulo hash, as well as some API documentation https://urldefense.proofpoint.com/v2/url?u=https-3A__sourmash.readthedocs.io_en_latest_api-2Dexample.html-23sourmash-2Dminhash-2Dobjects-2Dand-2Dmanipulations&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=tLrRui-nzNUodb5ShCAULA&m=dWoaB_dLvbwIyJ3o4QDQ0ITGIFbpsLNoOfBtI0wHCSw&s=Wj90WBSvvL1814oI5BKEbXwTJF3UpqjPyocGdQgeS3w&e= that goes further into depth on interconvertibility.

our code is not so pretty but here is the max_hash implementation https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dib-2Dlab_sourmash_blob_master_sourmash_kmer-5Fmin-5Fhash.hh-23L99&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=tLrRui-nzNUodb5ShCAULA&m=dWoaB_dLvbwIyJ3o4QDQ0ITGIFbpsLNoOfBtI0wHCSw&s=_RgC_cwBVxMmako_sy2kj_Oc9VB0UezaquDH3JhMkeY&e=.

we have a draft f1000research software paper nearly written that I'd be happy to send you if you want to see more polished prose :)

happy to discuss more!

other than saying hi, the main purpose of this issue is to recommend that you investigate (or at least support) mash compatibility by using murmurhash hashing with a seed of 42 and reverse complement handling. see our example Python code here https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dib-2Dlab_sourmash_blob_master_utils_compute-2Ddna-2Dmh-2Danother-2Dway.py-23L59&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=tLrRui-nzNUodb5ShCAULA&m=dWoaB_dLvbwIyJ3o4QDQ0ITGIFbpsLNoOfBtI0wHCSw&s=Fh0UkYPS_sA22kZnVMZP35RiUBo0xGtcroG9S20GRoE&e=. murmurhash is less efficient than rolling hash but on the other hand it's nice to be able to interconvert between these efforts!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_richarddurbin_modimizer_issues_1&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=tLrRui-nzNUodb5ShCAULA&m=dWoaB_dLvbwIyJ3o4QDQ0ITGIFbpsLNoOfBtI0wHCSw&s=1OyHbV0aP-n1G-u-Z95ri3XY-0CMPtcZOfDLZ8b4jsw&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADRb5oPIrZBUsueEQhMXRTDJgshWtW6mks5vJyW4gaJpZM4agO3E&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=tLrRui-nzNUodb5ShCAULA&m=dWoaB_dLvbwIyJ3o4QDQ0ITGIFbpsLNoOfBtI0wHCSw&s=tnFFFV2q9j5iH8Wg38SRu8SWX19mlaohP8ue7qDIAXw&e=.

ctb commented 5 years ago

nice! looking forward to seeing where you go with this project ;)

ctb commented 3 years ago

hi Richard, @luizirber (now Dr. @luizirber :) wrote this up, together with many associated benchmarks, for his thesis - if you're interested in seeing what we've done with it, the link is here:

https://github.com/luizirber/phd/releases

We're also working on a paper. Please let us know how to best link to your work - I was thinking of just citing this repository, but happy to do something else!

ctb commented 2 years ago

you might be interested in this conversation - https://twitter.com/krsahlin/status/1463169988689285125 - where a few related references are mentioned, and we converge on the name FracMinHash.