Closed lukasgebhard closed 4 years ago
I figured it out. To ignore the ordering of tokens, directly hash the tokens instead of their shingles:
def compute(text):
tokens = re.split(r'\W+', text.lower(), flags=re.UNICODE)
hashes = [simhash.unsigned_hash(t.encode('utf8')) for t in tokens]
return simhash.compute(hashes)
I guess this is not desired:
outputs
27
althoughoutputs
0
.