seomoz / simhash-py

Simhash and near-duplicate detection
MIT License
406 stars 115 forks source link

test_get_all fails #14

Closed eyvindn closed 7 years ago

eyvindn commented 9 years ago

For some reason, find/find_all works well works, but calling hashes() on a corpus results in completely wrong hashes being returned.

Running OSX 10.10.3 with Python 2.7.6, just cloned from the repo and followed the instructions.

Could it have something to do with 64 vs 32bit when compiling, which I am getting errors about?

During compile I get a lot of these errors:

warning: implicit conversion loses integer precision: 'uint64_t' (aka 'unsigned long long') to 'Simhash::hash_t' (aka 'unsigned long') [-Wshorten-64-to-32]

b4hand commented 9 years ago

simhash-py has only been tested in 64-bit environments. If you're compiling it for 32-bit, you're on your own for help.

eyvindn commented 9 years ago

I'm not compiling in a 32-bit environment, I investigated the issue further, and setup.py simply compiles a universal binary by default, causing some warnings during compilation.

However, forcing it to only compile for x86_64 doesn't help (obviously, as it was using the 64-bit implementation in the universal binary by default when I was running simhash).

Essentially the issue I'm having is that hese things don't work:

corpus = simhash.Corpus(6,3) corpus.insert(1) corpus.hashes()

Will return [0L] on every table in the corpus, regardless of what you first insert.

Also when you insert more into the corpus, more things definitely get added, but calling corpus.hashes()

Always returns the wrong hashes (besides the inital 0L, I believe they are the permuted versions of the hashes). Initially I thought this made sense, but upon further investigation it's pulling these from table 0, which should be the initial permutation of the hash (aka the untouched hash), so it doesn't really make sense.

eyvindn commented 9 years ago

The issue actually lies in simhash-cpp, namely the const_iterator is wrongly implemented and does not properly loop through the Judy array. I'll submit a fix.

b4hand commented 9 years ago

Could you please confirm that this has been fixed by pulling in recent updates from simhash-cpp in #19?

b4hand commented 7 years ago

I'm going to assume this was resolved with the prior PR since there's been no update on this ticket in a long while.