seomoz / simhash-py

Simhash and near-duplicate detection
MIT License
408 stars 115 forks source link

Explanation in README.md should more descriptive #47

Open sany2k8 opened 5 years ago

sany2k8 commented 5 years ago

This project is awesome but when I am trying to use it from my purpose to detect near-duplicate document e.g json, I'm not getting enough information on how to try to do that? It shows only to compute

import simhash

a = simhash.compute(...) 
b = simhash.compute(...)
simhash.num_differing_bits(a, b)

OR how to find matches using

import simhash

hashes = []
blocks = 4
distance = 3
matches = simhash.find_all(hashes, blocks, distance)`

but before that how can I make hashes of my documents? Can anyone update the README.md or post a full step by step example/tutorial to implement this simhash using python?

wmelton commented 5 years ago

@sany2k8 - You will likely want to strip out any syntax that is for transport or formatting before trying to compute hashes.

Additionally, it should be fairly obvious from the example.

a = simhash.compute("I've been working on the railroad")
b = simhash.compute("I've been working on the railroad, all the live long day.")

print(simhash.num_differing_bits(a, b)

a,b are your hashes of your 'documents' in this case.

If you have JSON documents, you will want to pass in the object property that represents the textual content you are wanting to compare against.

dlecocq commented 5 years ago

Unfortunately, it's a bit more complicated than that. Getting good results with simhash is highly dependent on the input to simhash.compute, and that input is highly nuanced.

It's an old post, but our original post on the subject will help to provide a lot of context for what I'll talk about here (https://moz.com/devblog/near-duplicate-detection/).

The input to compute is a list of 64-bit hashes that are representative of a document, and the output is a simhash of those input vectors. So this leaves us with the task of converting a document into the representative hash. For the sake of this discussion, let's suppose we're talking about text. Let's adopt I've been working on the railroad, all the live long day from above.

Tokenizing the text is a great place to start. This probably involves removing cruft, perhaps punctuation, maybe doing some canonicalization like downcasing where appropriate, and perhaps unicode normalization. This is far from a complete tokenization method, but it's good enough for us to use to talk about:

def tokenize(doc):
    import re
    return re.split('\s+', re.sub(r'[^\w\s]', '', doc.lower()))

When we tokenize our string, we get:

['ive', 'been', 'working', 'on', 'the', 'railroad', 'all', 'the', 'live', 'long', 'day']

We could just take the hashes of each of these words and then use that as an input. In some ways, it is representative of our document - if we were to change the words, it would change the vector of hashes. But simhash doesn't care about the order of the input hashes, so it doesn't capture the order of the words:

similar = "I've been working, all the live long day, on the railroad!"
different = "And now for something completely different"
document = "I've been working on the railroad, all the live long day"

def makeSimhash(doc):
    import ctypes
    # They need to be unsigned 64-bit ints
    return simhash.compute([ctypes.c_ulong(hash(token)).value for token in tokenize(doc)])

>>> makeSimhash(similar)
10533350132719349034L
>>> makeSimhash(document)
10533350132719349034L
>>> makeSimhash(different)
9341659116513922L

# These first two documents would be considered identical with this hashing scheme
>>> simhash.num_differing_bits(makeSimhash(similar), makeSimhash(document))
0
>>> simhash.num_differing_bits(makeSimhash(different), makeSimhash(document))
28

So, while we do get different outputs for very different inputs, there's still the problem of word order. The easiest way to solve this is to use overlapping windows of tokens. Let's take three consecutive tokens at a time and get a hash of those. For our example, that would give us:

[['ive', 'been', 'working'], ['been', 'working', 'on'], ['working', 'on', 'the'], ...]

These are often referred to as shingles because of the way they span multiple tokens and overlap. To this end, there's a simhash.shingle helper. We give it our tokens and then we give it the number of tokens we want each to span. The more tokens each shingle spans, the more context about word order we have, but it also runs the risk of being overly specific in its matching. At Moz, we found 3 or 4 to be a good window choice, if I recall correctly, and that's what's often used in the literature.

>>> list(simhash.shingle(tokenize(document), 4))
[['ive', 'been', 'working', 'on'], ['been', 'working', 'on', 'the'], ['working', 'on', 'the', 'railroad'], ['on', 'the', 'railroad', 'all'], ['the', 'railroad', 'all', 'the'], ['railroad', 'all', 'the', 'live'], ['all', 'the', 'live', 'long'], ['the', 'live', 'long', 'day']]

Let's adjust our makeSimhash to use this:

def makeSimhash(doc):
    import ctypes
    # A generator for ' '-joined strings of consecutive tokens
    shingles = (' '.join(tokens) for tokens in simhash.shingle(tokenize(doc), 4))
    # They need to be unsigned 64-bit ints
    return simhash.compute([ctypes.c_ulong(hash(shingle)).value for shingle in shingles])

>>> makeSimhash(similar)
14078824512481347173L
>>> makeSimhash(document)
15278483820769201676L
>>> makeSimhash(different)
5232801312017969235L

And now we see that these are indeed different (fewer differing bits means more similar):

>>> simhash.num_differing_bits(makeSimhash(similar), makeSimhash(document))
26
>>> simhash.num_differing_bits(makeSimhash(different), makeSimhash(document))
34

The threshold for similarity is also something that you need to figure out, and it is also contextually dependent. Ideally, you'd take a corpus of documents of the kind you'll be comparing, categorize matching pairs within that, and then do some analysis of what windows work best for detection. You also need to consider the precision and recall (https://en.wikipedia.org/wiki/Precision_and_recall) of your window choice. To determine the matching pairs of a corpus, you would probably want to use minhash - it produces very high-quality results, but it takes O(n^2) time to find all matches in a corups, where simhash takes O(n log(n)) time. Simhash is also less memory-intensive and compact.

Hopefully this provides y'all with some idea of how to at least get going with simhash and illustrates why unfortunately it's not just ✨ magic ✨ .

Let me know if you've got any other questions!

wmelton commented 5 years ago

Apologies @dlecocq - I was under the (obviously incorrect) assumption that tokenization was happening under the hood automagically (as it were) when calling the compute method. I thought at some time past I had seen a sliding window tokenizer in the source code.

Great write up though - thanks for the contribution!

dlecocq commented 5 years ago

@wmelton - I believe there indeed was a time when that was included. As I recall it, we removed it in part for simplicity and in part because we didn't want to take too many liberties / make too many assumptions about the inputs. This way it seems less like magic, but admittedly it makes it less clear.

I think the right balance is probably one in which a helper utility or two just works in a basic way on text.

wmelton commented 5 years ago

Yep - that makes perfect sense.

sany2k8 commented 5 years ago

@dlecocq You are a Rock Star :) & also thanks to @wmelton Sir... :)