seomoz / simhash-py

Simhash and near-duplicate detection
MIT License
406 stars 115 forks source link

Adding Python 3 compatibility #27

Closed rth closed 7 years ago

rth commented 7 years ago

This PR adds Python 3 compatibility to simhash-py,

rth commented 7 years ago

@b4hand Thanks for the feedback. I added an import of range/xrange from six as you requested.

b4hand commented 7 years ago

We still need to figure out what to do regarding the test failures before we can merge. I suspect that will get discussed further in #28.

rth commented 7 years ago

I added a commit to increase the mach_threshold from 3 to 5 bits on Python 3.4-3.5 to pass the failing test. Would this be an acceptable solution? I mean, from my limited understanding of this, it does not look like the implementation is not working for those Python versions, but just that particular example produces 5 different bits instead of 3 with the updated hash function. #28 can probably be addressed in a separate PR ?

b4hand commented 7 years ago

My recollection of that test is that it proves that a small change in the source text contributes to a small change in the hash. Doubling the bit difference may indicate an issue. I want to take a bit more time to look at the underlying values and make sure this is reasonable before merging it as is.

rth commented 7 years ago

@b4hand I was wondering if you had time to look into this Pull Request so it could be merged? I am aware that you had some concerns about the validity of the results with PY3; please let me know if I can do anything to address those (more tests etc).

From a user perspective, having a PY3 compatible version for simhash-py would be much appreciated. For instance, @kfrancoi recently did some work on porting to Python 3 in https://github.com/Sagacify/simhash-py/pull/2 fork. So I imagine it would better to have an official PY3 port (with if necessary some warning about the domain of validity) that would take into account your feedback and summarize work from different contributors, rather then multiple implementations in forks, each evolving independently..

As a side note, as far as I understood most of the calculations are done in your C++ library, so at least on Linux where both PY2 & PY3 would use gcc to compile this extension, the version of Python, I imagine, shouldn't matter so much beyond a few adjustments needed by this wrapper. On Windows, if the Microsoft Visual Studio compiler is used, with a version that depends on the Python version, that may be more problematic (cf. issue #29).

Thanks again!

b4hand commented 7 years ago

It appears we don't actually use simhash.unsigned_hash, so this really only impacts the test. Given that, I'm willing to merge this as is with the expectation that simhash.unsigned_hash will probably be removed at some point in the future since I don't think it's a good idea to rely on the built-in Python hash implementation unless we're going to use one of the "standard" hashing functions that can be language agnostic.

b4hand commented 7 years ago

BTW, sorry for the delay on getting this merged, but I've been pulled in a bunch of different directions over the last few months.

rth commented 7 years ago

Thanks a lot, @b4hand !