serega / gaoya

Locality Sensitive Hashing
MIT License
67 stars 6 forks source link

Optimization Ideas For MinHash #17

Open keithing opened 1 year ago

keithing commented 1 year ago

Noticed your comment on Hacker News about this repo. I worked on something similar at a previous job (so I don't have the code the share), but looked pretty deeply into optimizing minhash. Not sure if it's helpful or if you're already aware of these tricks, but I found them to significantly speed up my minhash algorithm (and they are fun to code :P).

1) One Permutation Hashing. Basically use a single hash function in clever way instead of the typical 128. Made a pretty big different in speed during my testing. 2) Densified One Permutation Hashing. A small tweak for handling the "empty bin" problem of one permutation hashing. 3) Reservoir Sampling. A trick for using less memory and avoiding dynamic allocations. This last paper also has a summary of the first two.

Caveat: Seems like you're working on Super Min Hash, which I was never able to fully code myself. I don't know how it compares to the one-hash approach.

serega commented 1 year ago

Thanks for your suggestions. I may have encountered some of these ideas in papers, but never took a deep look. Currently, I use MinHash for relatively short text documents (usually 50 - 200 characters), and generating min-hash signatures takes a small fraction of the whole pipeline. For bigger documents hashing can become a bottleneck. Memory usage is definitely an issue for me, which I mitigate by using smaller hashes (thanks to Rust generics this is easy). I should definitely look at Reservoir Sampling and One Permutation Hashing, (especially if they are fun to code :) )

jianshu93 commented 2 months ago

Hello All,

Just a followup, the most recently densification is called optimal densification: two ways to perform the densification are optimal (RMSE is the same with the original, much slower, many hash function implementation): http://proceedings.mlr.press/v70/shrivastava17a.html or http://proceedings.mlr.press/v115/mai20a.html. Implementation here: https://github.com/jean-pierreBoth/probminhash/blob/master/src/densminhash.rs

@serega I have some interesting idea to combine minhash with graphs to do very large scale search. Let me know if you are interested.

Jianshu