Open keithing opened 1 year ago
Thanks for your suggestions. I may have encountered some of these ideas in papers, but never took a deep look. Currently, I use MinHash for relatively short text documents (usually 50 - 200 characters), and generating min-hash signatures takes a small fraction of the whole pipeline. For bigger documents hashing can become a bottleneck. Memory usage is definitely an issue for me, which I mitigate by using smaller hashes (thanks to Rust generics this is easy). I should definitely look at Reservoir Sampling and One Permutation Hashing, (especially if they are fun to code :) )
Hello All,
Just a followup, the most recently densification is called optimal densification: two ways to perform the densification are optimal (RMSE is the same with the original, much slower, many hash function implementation): http://proceedings.mlr.press/v70/shrivastava17a.html or http://proceedings.mlr.press/v115/mai20a.html. Implementation here: https://github.com/jean-pierreBoth/probminhash/blob/master/src/densminhash.rs
@serega I have some interesting idea to combine minhash with graphs to do very large scale search. Let me know if you are interested.
Jianshu
Noticed your comment on Hacker News about this repo. I worked on something similar at a previous job (so I don't have the code the share), but looked pretty deeply into optimizing minhash. Not sure if it's helpful or if you're already aware of these tricks, but I found them to significantly speed up my minhash algorithm (and they are fun to code :P).
1) One Permutation Hashing. Basically use a single hash function in clever way instead of the typical 128. Made a pretty big different in speed during my testing. 2) Densified One Permutation Hashing. A small tweak for handling the "empty bin" problem of one permutation hashing. 3) Reservoir Sampling. A trick for using less memory and avoiding dynamic allocations. This last paper also has a summary of the first two.
Caveat: Seems like you're working on Super Min Hash, which I was never able to fully code myself. I don't know how it compares to the one-hash approach.