seomoz / simhash-py

Simhash and near-duplicate detection
MIT License
408 stars 115 forks source link

Lots of differing bits if doc1 = prefix + doc2 #49

Closed lukasgebhard closed 4 years ago

lukasgebhard commented 4 years ago

I guess this is not desired:

a = 'a b c d e f g h i j'
b = 'x ' + a
c = a + ' x'

# Copied from tests.py
def compute(text):
        tokens = re.split(r'\W+', text.lower(), flags=re.UNICODE)
        shingles = [''.join(shingle) for shingle in
                    simhash.shingle(''.join(tokens), 4)]
        hashes = [simhash.unsigned_hash(s.encode('utf8')) for s in shingles]
        return simhash.compute(hashes)

print(simhash.num_differing_bits(compute(a), compute(b)))

outputs 27 although

print(simhash.num_differing_bits(compute(a), compute(c))

outputs 0.

lukasgebhard commented 4 years ago

I figured it out. To ignore the ordering of tokens, directly hash the tokens instead of their shingles:

def compute(text):
        tokens = re.split(r'\W+', text.lower(), flags=re.UNICODE)
        hashes = [simhash.unsigned_hash(t.encode('utf8')) for t in tokens]
        return simhash.compute(hashes)