partition MinHash (or One Permutation) and bottom-s MinHash

jianshu93 commented 1 year ago

Dear Camille,

Many thanks for making the "Sketching in sequence bioinformatics: methods and applications" slides open source. Several questions related to bottom-s MinHash, One Permutation MinHash with optimal densification and scaled MinHash (or containment MinHash):

the partition MinHash you mentioned (they called One permutation MinHash with optimal densification), should have running time O(n+s) instead of O(n.log(s)) (see Shrivastava 2017), the latter is much slower because us n is always much larger than s (at least several orders of magnitude).
LSH property. Both the original MinHash and One permutation MinHash with optimal densification, has been proven to have the LSH property, or alignment property theoretically (Shrivastava 2017). However, the bottom-s implementation has not been proved, think about it, it is equal to sampling without replacement and thus cannot remain LSH despite the same variance with classic MinHash. Details about the alignment property can be found in the 2017 FOCS paper (https://ieeexplore.ieee.org/abstract/document/8104099) and this paper (http://proceedings.mlr.press/v115/mai20a.html). Therefore, bottom-s MinHash cannot be used in a nearest neighbor search manner or context because LSH ensures that the correct nearest neighbors can be found theoretically.
Containment MinHash, the disadvantage of bottom-s implementation of Mash is the reason for set size bias (sampling without replacement). Classic MinHash or One Permutation & densified MinHash does not have this problem. For containment MinHash, if we want to estimate Jaccard index from the containement Jaccard of (A,B) or (B,A), we need the cardinality of set A or B (or number of unique kmers), which is another challenge for large set (e.g. genomes) if using fast method like hyperloglog (see discussion here: https://github.com/KoslickiLab/mutation-rate-ci-calculator/issues/13) because it can have much larger variance than MinHash (see this PODS paper: https://dl.acm.org/doi/abs/10.1145/3584372.3588680, table 1) (the oftenly used one is the Harmonic Mean of LogLog, which is called hyperloglog, it has an RMSE of sqrt(1.07944/m), this will be much less accurate for small Jaccard, m is sketch size) and it is not LSH. The scaled MinHash paper use true cardinality from exact methods for variance analysis, which is not practical for genomes with ~10^7 kmers (huge memory and slow), we have to use hyperloglog-like ones for much smaller memory and also speed. See a bunch of newly invented cardinality estimator with even better space requirement (https://arxiv.org/abs/2308.16862) than HyperLogLog, but still it is only closer the the Cramér-Rao lower bound but never reach it.
CMash & scaled MinHash, similar to that of bottom-s MinHash, lose the LSH property because (see discussion above) J (A,B) is not equal to J (B, A) and no alignment property.
Our implementation of one permutation with optimal densification (Shrivastava 2017) and faster densification (http://proceedings.mlr.press/v115/mai20a.html) (will be available soon) and comparison with classic MinHash showed that the variance can be larger depending on the different size between set size n and sketch size s (or m). See attached plot (n is 300,000, we use RMSE instead of error, thus classic MinHash is sqrt (J(1-J))/m). The implementation is identical to BinDash implementation, which is the Shrivastava 2017.

GSearch_Densified_MinHash_RMSE.pdf

Looking forward to you reply.

Thanks,

Jianshu

jianshu93 commented 1 year ago

And I think all error should be expressed as RMSE sqrt(J(1-J)/s), but not sqrt(1/s) (a simple way for convenience but not the true RMSE) because for small Jaccard like 0.1, those two can be ~sqrt(10) times different in terms of RMSE.

Jianshu

kamimrcht commented 11 months ago

Hi @jianshu93, thanks for the corrections and pointers. I've uploaded a new version of the slides with small corrections, that is followed by notes highlighting your points. Tell me what you think! Best!

jianshu93 commented 11 months ago

Hello @kamimrcht,

Thanks for the update. it looks nice this time. Just several other new papers related to MinHash, C-MinHash (https://proceedings.mlr.press/v162/li22m.html), a successor of densified MinHash, proved to have smaller variance than the best densified MinHash. However, the second permutation to generate hash values for empty bins cannot be replaced by random hashing, thus a permutation vector of D/K (I follow notation from the C-MinHash paper), where D is the dimensions of original data while K is the number of bins in the sketch vector, has to be kept in memory and ran for K times, it is computationally equivalent with the optimal densification paper (e.g., the outer loop is the number of empty bins) but with smaller variance, even than classic MinHash, very surprising theoretical analysis/results. There might be some problems with memory when dataset dimension/kmers is very large, like 2^40, sketch size 2^15, we have to keep a large permutation vector with size 2^25, can be 10G+ or more memory. But for microbial genomes, seems to be ok, dimensions/kmers is never larger than 10^8. I am In the process of implementing C-MinHash, so interesting to see there are still improvement even until today.

Thanks,

Jianshu

kamimrcht commented 11 months ago

Thanks again. I'll keep an eye on this one. Looking forward to your improvements! Best

kamimrcht / webpage

partition MinHash (or One Permutation) and bottom-s MinHash #4