oertl / probminhash

ProbMinHash – A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity
42 stars 6 forks source link

question on J_N and J_p #2

Closed jianshu93 closed 7 months ago

jianshu93 commented 1 year ago

Hello Otmar,

I have a question on J_N and J_P, since both are scale-invariant, J_N is the normalized weighted jaccard similarity, is J_p (assume P-minhash is used) closer to J (unweighted, no-normalization) or J_N for a give set, where there are weight for each element in the set. I imagine J and J_N is not the same for the set. If my goal was to use J_N to compare 2 sets, will J_p give similar answer.

Thanks,

Jianshu

oertl commented 1 year ago

Hi Jianshu,

J_N and J_p are different similarity measures for weighted sets. This is similar to comparing cosine similarity or Jaccard similarity for unweighted sets. Both behave similarly in some ways. For disjoint sets the similarity is 0 and for identical sets it is 1. The work of Moulton that introduced J_p (Maximally Consistent Sampling and the Jaccard Index of Probability Distributions) compared J_N (called J_W in the paper) and J_p. Theorem IV.5 describes an inequality relating the two similarity measures. The paper also shows that J_p is Pareto-optimal, which is a property that J_N does not have.

So if you need a scale-invariant similarity measure for weighted sets, I'd rather use J_p first, since hashing for J_p with ProbMinHash is easier and faster than for J_N (or weighted Jaccard similarity) with modern algorithms like BagMinHash, TreeMinHash, or DartMinHash.

Best regards,

Otmar

jianshu93 commented 1 year ago

Hello Otmar,

Thanks for the quick response and those papers. I am very interested in the behaviors of J_N and J_p, as you said they have very similar behaviors. If there are 3 sets, A, B and C, J_N(A,B)=0.2, J_N(B,C)=0.3, J_N(A,C)=0.4, if we correlate those 3 J_Ns with corresponding 3 J_ps for A,B and C, will spearman rank rho be 1? That is to ask whether J_N and J_p, in the range of [0,1] have similar rate of changes, so that ranks do not change. Let me know if I am not clear. This is quite important for nearest neighbor search because if the rank property is true, then we can simply replace J_N with J_p. If it is not, we might have problems.

Thanks,

Jianshu

oertl commented 1 year ago

Hello Jianshu,

are you asking if (1 - J_p) is a metric supporting the triangle inequality? If this was your question, then yes, see Theorem IV.6 in Moulton's paper. I'm not sure what other property you need for nearest neighbor search.

Best

Otmar

jianshu93 commented 1 year ago

Hello Otmar,

YES! Thanks for the info. I am very clear about J_N and J_P now. I saw reimplement of ProbMinHash and SuperMinHash in Rust: https://github.com/jean-pierreBoth/probminhash It seems if I want to use J_W, dartminhash is the fastest right now. It seems Dartminhash is for J_W (not normalized), it is easily to adapt to normalize by the total set size for dartminhash to have J_N right, when total set size differs.

Thanks,

Jianshu

oertl commented 1 year ago

Hi Jianshu,

yes you can easily apply any algorithm for J_W to J_N by just normalizing the weights first. I compared the performance of TreeMinHash vs. DartMinHash for various cases here. The case with all weights exponentially distributed with rate n (w(d) ~ Exp(n)) is approximately the same as the case with normalized weights, since the expected value of w(d_1) + w(d_2) + ... + w(d_n) would be equal to 1. DartMinHash is significantly faster for this particular case and if the set size exceeds the signature size m. Maybe, TreeMinHash could be further optimized, to come closer to DartMinHash's performance also in this case.

Otmar

jianshu93 commented 1 year ago

Hello Otmar,

Many thanks! this is very helpful, Another question or real world problem with weighted minhash we have is memory problem, since in weighted case each element in the set has to be considered to get the weight (or relative frequencies), this is significantly much more memory expensive than simple case according to our experience with real world dataset, specifically genomes, when there billions of genomes (thus billions of k-mers, which are substring of a genome, k can be 16-32, each k-mer has a weight, we want to hash all those weighted k-mers at at a time). The steaming idea from the HistoSketch and D2HistoSketch seem to provide a solution but not sure what it looks like in real implementations. Just wondering how dartminhash and treeminhash or probminhash deal with the memory problem, when there are so many set elements with weights.

Thanks,

Jianshu

dnbaker commented 1 year ago

In our Dashing2 work, we used a count sketch or count-min sketch with a single table to feature hash for approximate weighted set sketching. (IE, get weighted set sketching without building the full k-mer frequency table.

With a large enough table size, you can get very little loss in accuracy while bounding your memory requirements.

That's the best solution of which I'm aware to this problem.

https://github.com/dnbaker/dashing2/blob/main/src/counter.h

this solution does exact counting by default but falls back to feature hashing if given a memory limit.

oertl commented 1 year ago

@jianshu93

The steaming idea from the HistoSketch and D2HistoSketch seem to provide a solution but not sure what it looks like in real implementations.

Please be aware that these algorithms can be used for J_P, but not for J_N, as the paper incorrectly states.

If the data is free of duplicates, ProbMinHash for J_P and BagMinHash for J_N can be used in a streaming fashion. However, if there are duplicates, you must first deduplicate and aggregate the corresponding weights, which requires at least a few additional passes over the data. (For example, using Merge Sort if the data does not fit completely in memory.) If you apply these algorithms without deduplication, the effective weight of a multi-occurring element is the maximum of all the individual weights, which is most likely not what you want. (Typically, the sum of the weights of equal elements is more appropriate.)

jianshu93 commented 1 year ago

@oertl This is clearly stated in your ProbMinHash paper and I am well aware of the problem J_p for HistoSketch. The problem of deduplication is also clear to me. Thanks for the detailed responses, I think I can try to implement the HistoSketch idea to see what happens.

@dnbaker , Hello Daniel, I was studying how Dashing 2 achieve this goal for probminhash and there you came. I was not sure the count-sketch and count-min option was used for without reading the code, this is very helpful, especially for metagenomes, if we want to use weighted minhash, the memory will grow very fast even for 20s metagenomes or so.