Lots of differing bits if doc1 = prefix + doc2

I guess this is not desired:

a = 'a b c d e f g h i j'
b = 'x ' + a
c = a + ' x'

# Copied from tests.py
def compute(text):
        tokens = re.split(r'\W+', text.lower(), flags=re.UNICODE)
        shingles = [''.join(shingle) for shingle in
                    simhash.shingle(''.join(tokens), 4)]
        hashes = [simhash.unsigned_hash(s.encode('utf8')) for s in shingles]
        return simhash.compute(hashes)

print(simhash.num_differing_bits(compute(a), compute(b)))

outputs 27 although

print(simhash.num_differing_bits(compute(a), compute(c))

outputs 0.

seomoz / simhash-py

Lots of differing bits if doc1 = prefix + doc2 #49