mikeheddes / dothash

Estimating Set Similarity Metrics for Link Prediction and Document Deduplication
https://arxiv.org/abs/2305.17310
GNU General Public License v3.0
7 stars 1 forks source link

benchmark with nodesketch? #2

Open jianshu93 opened 1 month ago

jianshu93 commented 1 month ago

Dear DotHash team,

Thanks for a nice paper, I have 2 questions related to both MinHash and link prediction:

  1. What MinHash was used, on permutation hashing with optimal densification (http://proceedings.mlr.press/v70/shrivastava17a.html) or the old original MinHash from Broder 1997. The former is at least 10 times faster than the old one, with exactly the same RMSE or error, converged to 0 as sketch size increases. Dothash can estimate set intersection without bias, how about cardinality of each separate set (we need this for Jaccard according to inclusion and exclusion rule). Cardinality estimation based on HyperLogLog et.al., has a Camer-Rao lower bound which is much worse than Minhash for final Jaccard similarity estimation. Do you assume Cardinality of set A and B known?

  2. NodeSketch (https://dl.acm.org/doi/abs/10.1145/3292500.3330951), use probminhash (https://ieeexplore.ieee.org/abstract/document/9185081, they called it HistoSketch, but they did not realize the minhash they use is approximating the J_p but not weighted Jaccard) for comparing node similarity (considering weights), This is also very fast and efficient for link prediction tasks. We have a Rust implementation here: https://github.com/jean-pierreBoth/graphembed

Thanks,

Jianshu

mikeheddes commented 1 month ago

Hi Jianshu,

Thank you for reaching out.

  1. For the MinHash implementation we used the original one by Broder. We later learned about the one permutation hashing method which indeed gives a significant speed up over the original MinHash. However, both methods are not able to estimate metrics like Adamic-Adar and Resource Allocation which can be estimated with our method. And yes, we assume that the cardinality of both sets is known. This is because our motivation is to reduce the cost of the many pair-wise comparisons, any one-time cost on each set is considered negligible.

  2. Thank you for bringing this work to our attention, it looks very interesting.

Let me know if you have any other questions.

jianshu93 commented 1 month ago

Hi @mikeheddes,

The beauty of one permutation hashing (with appropriate densification) is that hashes that collide are at the same position of sketch vector from 2 sets, so a hamming similarity can be computed extremely fast via bit XOR and population count, see our BinDash 2 paper (it was for comparing genomes but the idea is the same). overall we saw a 10 to 50 speedup compare to the old one where a priority queue is needed to sort hashes and 2 for loops to find common hashes between sketches.

Jianshu