seomoz / simhash-py

Simhash and near-duplicate detection
MIT License
406 stars 115 forks source link

find_all() returns wrong results #22

Closed ghost closed 7 years ago

ghost commented 8 years ago
import simhash

corpus = simhash.Corpus(6, 3)
corpus.insert(1)
corpus.insert(2)
corpus.find_all(2)

Running this code returns: [2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L]

dlecocq commented 8 years ago

This is actually the current intended behavior. What happens in find_all is that it walks through all the tables for the block/bit-difference configuration (there are blocks C bit-difference tables), and for each table finds all matches within the bit-difference.

For the configuration (6, 3), there are 6 C 3 = 20 tables, which means that we actually expect to find identical matches exactly 20 times:

corpus = simhash.Corpus(6, 3)
corpus.insert(2)
# This comes out to be 20
len(corpus.find_all(2))

In the original example you provided, we find exactly 20 entries for 2, and another 10 for 1. Since they differ by exactly one bit, only one of the six blocks will contain the differing bit, so most tables will turn up a map for that query.

Does that help to make sense of it?

ghost commented 8 years ago

Thank you for the explanation. I understand why it happens, but why is this the intended behavior?

dlecocq commented 8 years ago

I guess when I say it's the current intended behavior, I mean that it is working as expected :-)

I would welcome a PR to make it return a set

b4hand commented 7 years ago

It appears that simhash.Corpus has been removed, so I'm guessing this issue is a "Won't Fix".