seomoz / simhash-py

Simhash and near-duplicate detection
MIT License
408 stars 115 forks source link

Is simhash-cpp 100x faster than simhash-py? #21

Closed ErnstHowie closed 6 years ago

ErnstHowie commented 8 years ago

Hello:

Thanks very much for sharing the great codes! Your works are wonderful. I have a question about the efficiency of simhash-cpp and simhash-py.

I installed simhash-cpp (https://github.com/seomoz/simhash-cpp) and simhash-py (https://github.com/seomoz/simhash-py) and run the benchmark. I got the following results: (1) simhash-cpp: ../simhash-cpp/src$ ./bench 1000000 blocks=6, bits=3 Inserting 1000000 hashes... Running 4000000 queries... Queries complete with 0 errors Running time: total=0.705171s, avg=0.17629275us There are 9999999 items in the table

(2) simhash-py: ../simhash-py/bench.py --random 1000000 --blocks 6 --bits 3 Generating 1000000 hashes Generating 1000000 queries Starting Bulk Insertion Ran Bulk Insertion in 7.518402s, avg: 7.518402us Starting Bulk Find First Ran Bulk Find First in 13.021438s, avg: 13.021438us Starting Bulk Find All Ran Bulk Find All in 14.687295s, avg: 14.687295us Starting Bulk Removal Ran Bulk Removal in 8.982185s, avg: 8.982185us

Based on the above results, I found that the average times over 1000000 hashes of each query are: simhash-cpp is 0.17629275us and simhash-py is 13.021438us. So simhash-cpp is about 100x faster than simhash-py. However, I checked the codes of simhash-py. I found that simhash-py is actually built on simhash-cpp. In my view, simhash-py is just a python wrapper of simhash-cpp. So I think simhash-py should be slower than simhash-cpp, but their difference should not up to almost 100x. My question is why simhash-cpp is about 100x faster than simhash-py. I don't know if my understanding is right, or if I missed something. If I made something wrong, please correct me!

Thanks!

dlecocq commented 8 years ago

You're right -- this seems odd. I would not have expected the difference to be nearly that substantial. I could understand a difference of 2 or 3x, but not 100x. I'll see if I can reproduce it.

ErnstHowie commented 8 years ago

Thx @dlecocq ! Looking forward your feedback!

pombredanne commented 6 years ago

?