Open bikashg opened 6 years ago
They should, but the current implementation in simhash-cpp
is the trouble. I believe @neilmb recognized this fact a while back.
I seem to recall the use of std::unordered_map
as being a deliberate choice (https://github.com/seomoz/simhash-cpp/blob/master/src/simhash.cpp#L57), but at the moment I don't see why it couldn't be a std::vector
. Certainly in the body of the code a std::vector
is used.
Oddly, this line seems to indicate that the input is a vector
. However, at least as far as the repo history is concerned the argument has always been an unordered_map
.
It also seems odd to me that the input hashes are passed as a non-const reference, especially since the code doesn't seem to modify that argument directly, but rather only copies from it.
Any news on that?
I came across the same behavior - two identical hashes are not seen as duplicates by find_all
It seems to me that the easiest fix is to change it from unordered_set
to vector
in simhash-cpp
's find_all
. I have the bandwidth for review, but not the PR itself, and it doesn't look like I have merge powers on either repo, either :-/
Any update on this issue? Would be great if it's fixed.
@dlecocq I changed unordered_set
to vector
but getting an error while installaing
simhash.cpp
Simhash::matches_t Simhash::find_all(
std::vector<Simhash::hash_t>& hashes,
size_t number_of_blocks,
size_t different_bits)
{
Getting error:
simhash/simhash-cpp/src/simhash.cpp:59:26: error: ‘Simhash::matches_t Simhash::find_all(std::vector<long unsigned int>&, size_t, size_t)’ should have been declared inside ‘Simhash’
59 | size_t different_bits)
| ^
can you show me how you made it?
I have the following code where 'a' and 'b' have the same value. However, they are not considered as simhash pairs by the simhash.find_all() method.
a = 8550830854347186281 b = 8550830854347186281
print ("Inputs differ in "+str(simhash.num_differing_bits(a, b))+" bits.")
all_simhash_pairs = simhash.find_all([a,b], 2, 1) print ("Simhash Pair counts = "+str(len(all_simhash_pairs)))
Shouldn't they be included in the results? Or, maybe I didn't understand simhashing properly.