Jaccard Index is both slow and incorrect

What steps will reproduce the problem?
1. Take two sparse vectors, A and B, with 100k dimensions each and the 
following nonzero values:
A: 1 -> 10, 100000 -> 5
B: 2 -> 10, 100000 -> 5
1. Run Similarity.jaccardIndex(A, B)
2. Wait very patiently
3. see that they are equivalent and score 1.0.

What is the expected output? What do you see instead?
the score should be 1/2, since feature 1 only appeared in A and feature 2 only 
appeared in B.  Since the Jaccard Index is an evaluation of feature sets, not 
an evaluation of feature occurrences.  The frequency of the observation 
shouldn't matter, just the existence of some observation.

Original issue reported on code.google.com by FozzietheBeat@gmail.com on 20 Sep 2011 at 5:33

mitrevf / airhead-research

Jaccard Index is both slow and incorrect #102