bottom-k implementation

jianshu93 commented 10 months ago

Dear Finch team,

Just to confirm that Finch use exactly the bottom-k implementation of MinHash (use only one hash function), not the expensive original MinHash (use many hash functions).

Thanks, Jianshu

boydgreenfield commented 10 months ago

@jianshu93 – The implementation follows the Mash paper, with the caveat that Finch does not use 42 as a seed for murmurhash3 by default and Mash does (see this issue and our Mash-compatible code here). In both cases, the hashing function is only called once per k-mer.

Does that answer your question?

jianshu93 commented 10 months ago

Hello @boydgreenfield,

Thanks for confirming and this is very clear. I was implementing some other interesting hash functions, specifically for MinHash, called mixed tabular hashing (https://ieeexplore.ieee.org/abstract/document/7354457). The reason for this is that the bottom-k implementation of MinHash (use one hash function only for each element and then keep the smallest hash values only via sorting for calculating Jaccard index later on) is not a locality sensitive hashing because the hashes in the sketch A and B are not aligned, meaning hash values of similar elements/kmers are not kept in the similar position in the sketch vector after sorting (see also the paper mentioned above). This could lead to large bias in terms of nearest neighbor search (e.g., finding a closest genomes in a database of genomes based on Jaccard-ANI, mash equation), see the paper here (http://proceedings.mlr.press/v70/shrivastava17a.html), in this paper the one permutation hashing mentioned is essentially the k-partitioning idea in the above paper but just use permutation or random hash function (e.g., murmurhash3), no specifically designed hash function. Since there might be empty bins in the sketch vector, densification after one permutation hashing or k-partitioning is need to improve the RMSE, which is proven to be smaller than classic MinHash or bottom-k MinHash, and much faster than bottom-k, O(n+m) for k-partitioning/one permutation hashing + densification, while O(n*log(m)) for bottom-k, when n is 10^6 (for bacterial genomes) and m is 10^3 to 10^4, the former is at least 3 times faster than the later and is locality sensitive hashing, theoretically guaranteed to find nearest neighbors in a database. If mixed tabular hashing is used, we can have even better concentration of hash values in the sketch vector than just random hashing, densification can then further improve RMSE of the Jaccard estimation (comapre to true Jaccard). Both k-partitioning and One permutation hashing require only one hash function and no sorting (via some priority queue or something) and a densification step is only related to sketch size m, not related to original n (number of elements in the original dataset). The original Broder 1998 paper use many hash functions is locality sensitive hashing but is too slow. Just want to hear the thoughts from the finch team on the most recent MinHash breakthoughs.

Thanks,

Jianshu

onecodex / finch-rs

bottom-k implementation #68