What steps will reproduce the problem?
1. Take two sparse vectors, A and B, with 100k dimensions each and the
following nonzero values:
A: 1 -> 10, 100000 -> 5
B: 2 -> 10, 100000 -> 5
1. Run Similarity.jaccardIndex(A, B)
2. Wait very patiently
3. see that they are equivalent and score 1.0.
What is the expected output? What do you see instead?
the score should be 1/2, since feature 1 only appeared in A and feature 2 only
appeared in B. Since the Jaccard Index is an evaluation of feature sets, not
an evaluation of feature occurrences. The frequency of the observation
shouldn't matter, just the existence of some observation.
Original issue reported on code.google.com by FozzietheBeat@gmail.com on 20 Sep 2011 at 5:33
Original issue reported on code.google.com by
FozzietheBeat@gmail.com
on 20 Sep 2011 at 5:33