seomoz / simhash-py

Simhash and near-duplicate detection
MIT License
406 stars 115 forks source link

Explanation in README.md is somehow incomprehensive. #24

Open Fazel94 opened 8 years ago

Fazel94 commented 8 years ago

How can I actually use this library to detect duplicate document, The function insert only accept integers so I suppose I should implement minHash myself(Is it true??) and then use your simHash library to detect near-duplicates. How can I connect this to simhash-db you provided? does that needs something users themselves should implement?? Sorry for being noob in Near duplicate detection bussiness :smile: and Thank you for sharing your work for us :+1:

darentanDZ commented 6 years ago

Up same question!

rth commented 6 years ago

See this blog post https://moz.com/devblog/near-duplicate-detection/ for a more general introduction and feel free to make a PR to improve the readme.

Here is an example of an estimator that uses simhash-py for near duplicate detection, it takes as input a sparse csr matrix created by scikit-learn's CountVectorizer or HashingVectorizer. I'm not saying there aren't better ways to use it though..

dlecocq commented 6 years ago

I think the most confusing aspect of how simhashing is talked about is the fact that there are two aspects to it, but the term 'simhash' is often used to refer to both: 1) a function to generate a simhash from input, and 2) methods for identifying near-duplicates from a set of simhashes.

To the first aspect, the input to the simhash hash function is a vector of hashes representative of the entity. There aren't really constraints on the size of this vector of hashes, but how this vector of generators has the most impact on the quality of the near-duplicates found. The Moz blog post that @rth referenced describes a number of ways in which that vector might be generated, but the hash of a rolling window of words is a pretty reasonable way.

To the second aspect, simhash provides a way to identify all the pairs of simhashes that are 'close enough' to be near-duplicates. This was the most fun and interesting part to code up, but it is agnostic to how simhashes were generated, so it doesn't have any impact on the quality of the matches found.

It is also important to note that when running the near-duplicate check, all the input simhashes must have been generated with the same method to have meaningful results. For example, if you have two simhashes, one generated with a bag of words and the other generated with shingles, they are not compatible.

I will try very hard to find some time to expand the README for this repo, since I, too, find it to be less than useful.

bikashg commented 6 years ago

@dlecocq : I think the first aspect you described is about generating minhash of the document (not generating simhash, as you mentioned). The second aspect is about taking the minhashes of the documents and predicting whether they are similar or not. The "simhash" is the total process that should do both the first and the second aspect you talked about.

Given that this library name was "simhash", I expected to input it a list of documents and obtain their pairwise similarity as the output. But it seems that this library does only the second aspect and leaves the user to do the first part on their own.

BTW, the blog post link mentioned by @rth is dead. Does anyone have an updated link?

rth commented 6 years ago

Yes, the original link tp seomoz blogs now returns ERR_SPDY_PROTOCOL_ERROR for me as well.

Here is the cached url.

dlecocq commented 6 years ago

@bikashg - making an effective simhash for a document depends greatly on the types of documents used and what aspect of the documents is meant to be matched, so making a method that works well in all situations is not practical. With that said, there is a top-level shingle helper that can help quite a bit:

import re
from simhash import compute, shingle, unsigned_hash

# If document is a string with just words, this provides a VERY basic tokenization and subsequent simhash
tokens = re.compile('\W+').split(document)
compute(map(unsigned_hash, shingle(tokens)))

With regard to 'only' performing the last simhash bit, this library also provides the ability to find all the matches once the simhashes are generated.

bikashg commented 6 years ago

@dlecocq : Thanks for the explanation. Yes and I also went through the test cases in test.py which further helped to understand the whole pipeline. My use of world 'only' was probably harsh, sorry for that; didn't mean it that way. Cheers for putting the effort to build this library.

b4hand commented 6 years ago

The link to https://moz.com/devblog/near-duplicate-detection/ still works for me as of this morning. Maybe it was a transient failure?

rth commented 6 years ago

@b4hand For me it also works with wget and Firefox 57 but not Chrome 59 (see error above); probably some SPDY related server misconfiguration..

b4hand commented 6 years ago

As far as I know, we're not using SPDY. I've reported the issue to the team in charge of that page, it should get fixed. Thanks for letting us know about the issue.

dlecocq commented 6 years ago

@bikashg - no worries - maybe we could have a 'quickstart' section to the README? I'd welcome a PR, as it's sometimes hard to make good onboarding documentation after having worked on all the internals of a library.

@b4hand - https://moz.com/devblog/near-duplicate-detection/ does not work for me, either using Chrome 66.0.3359.117 on Mac OS 10.13.4

b4hand commented 5 years ago

FYI, the devblog link should be working again for Chrome.