Open jianshu93 opened 1 year ago
And I think all error should be expressed as RMSE sqrt(J(1-J)/s), but not sqrt(1/s) (a simple way for convenience but not the true RMSE) because for small Jaccard like 0.1, those two can be ~sqrt(10) times different in terms of RMSE.
Jianshu
Hi @jianshu93, thanks for the corrections and pointers. I've uploaded a new version of the slides with small corrections, that is followed by notes highlighting your points. Tell me what you think! Best!
Hello @kamimrcht,
Thanks for the update. it looks nice this time. Just several other new papers related to MinHash, C-MinHash (https://proceedings.mlr.press/v162/li22m.html), a successor of densified MinHash, proved to have smaller variance than the best densified MinHash. However, the second permutation to generate hash values for empty bins cannot be replaced by random hashing, thus a permutation vector of D/K (I follow notation from the C-MinHash paper), where D is the dimensions of original data while K is the number of bins in the sketch vector, has to be kept in memory and ran for K times, it is computationally equivalent with the optimal densification paper (e.g., the outer loop is the number of empty bins) but with smaller variance, even than classic MinHash, very surprising theoretical analysis/results. There might be some problems with memory when dataset dimension/kmers is very large, like 2^40, sketch size 2^15, we have to keep a large permutation vector with size 2^25, can be 10G+ or more memory. But for microbial genomes, seems to be ok, dimensions/kmers is never larger than 10^8. I am In the process of implementing C-MinHash, so interesting to see there are still improvement even until today.
Thanks,
Jianshu
Thanks again. I'll keep an eye on this one. Looking forward to your improvements! Best
Dear Camille,
Many thanks for making the "Sketching in sequence bioinformatics: methods and applications" slides open source. Several questions related to bottom-s MinHash, One Permutation MinHash with optimal densification and scaled MinHash (or containment MinHash):
GSearch_Densified_MinHash_RMSE.pdf
Looking forward to you reply.
Thanks,
Jianshu