seomoz / simhash-py

Simhash and near-duplicate detection
MIT License
408 stars 115 forks source link

output of simhash.compute method #46

Closed bikashg closed 6 years ago

bikashg commented 6 years ago

I printed the output of simhash.compute() method -- both its type and value. I noticed that the type is integer and value is 19 digit number (eg: 8550830854347186281) . Shouldn't it be a 64 digit fingerprint consisting of only 0s and 1s ?

dlecocq commented 6 years ago

Yep. It's just the integer representation of the fingerprint:

>>> bin(8550830854347186281)
'0b111011010101010101001110101101110010111110100000110000001101001'
bikashg commented 6 years ago

Thanks for the reply. So, the program internally uses the binary stream (for matching) but displays the integer for printing purposes? Also, please help me understand the association between 64 bit binary and 19 digits integer.

dlecocq commented 6 years ago

Internally, the fingerprints are stored as a uint64_t - an unsigned 64-bit integer. These integers are compared to one another when identifying near-duplicates (by comparing the number of bits by which they differ). The ~19-digit integer is just the base-10 representation of the fingerprint.