seomoz / simhash-py

Simhash and near-duplicate detection
MIT License
406 stars 115 forks source link

Results of simhash.find_all() #44

Open bikashg opened 6 years ago

bikashg commented 6 years ago

I have the following code where 'a' and 'b' have the same value. However, they are not considered as simhash pairs by the simhash.find_all() method.

a = 8550830854347186281 b = 8550830854347186281

print ("Inputs differ in "+str(simhash.num_differing_bits(a, b))+" bits.")

all_simhash_pairs = simhash.find_all([a,b], 2, 1) print ("Simhash Pair counts = "+str(len(all_simhash_pairs)))

Shouldn't they be included in the results? Or, maybe I didn't understand simhashing properly.

dlecocq commented 6 years ago

They should, but the current implementation in simhash-cpp is the trouble. I believe @neilmb recognized this fact a while back.

I seem to recall the use of std::unordered_map as being a deliberate choice (https://github.com/seomoz/simhash-cpp/blob/master/src/simhash.cpp#L57), but at the moment I don't see why it couldn't be a std::vector. Certainly in the body of the code a std::vector is used.

b4hand commented 6 years ago

Oddly, this line seems to indicate that the input is a vector. However, at least as far as the repo history is concerned the argument has always been an unordered_map.

b4hand commented 6 years ago

It also seems odd to me that the input hashes are passed as a non-const reference, especially since the code doesn't seem to modify that argument directly, but rather only copies from it.

MatFluor commented 5 years ago

Any news on that?

I came across the same behavior - two identical hashes are not seen as duplicates by find_all

dlecocq commented 5 years ago

It seems to me that the easiest fix is to change it from unordered_set to vector in simhash-cpp's find_all. I have the bandwidth for review, but not the PR itself, and it doesn't look like I have merge powers on either repo, either :-/

nslallu commented 4 years ago

Any update on this issue? Would be great if it's fixed.

sushilr007 commented 3 years ago

@dlecocq I changed unordered_set to vector but getting an error while installaing

simhash.cpp

Simhash::matches_t Simhash::find_all(
    std::vector<Simhash::hash_t>& hashes,
    size_t number_of_blocks,
    size_t different_bits)
{

Getting error:

simhash/simhash-cpp/src/simhash.cpp:59:26: error: ‘Simhash::matches_t Simhash::find_all(std::vector<long unsigned int>&, size_t, size_t)’ should have been declared inside ‘Simhash’
   59 |     size_t different_bits)
      |                          ^

can you show me how you made it?