mattilyra / LSH

Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
MIT License
281 stars 80 forks source link

Tests and ease of use #2

Closed mbatchkarov closed 7 years ago

mbatchkarov commented 8 years ago

Hi,

This is a preliminary PR that adds a few utility functions and a bunch of unit tests. I do not expect it to be merged yet, especially until I've updated the documentation. At this point I am interested in whether these changes match your vision of the project and make it easier to use. If they do, I'll spend some more time on the documentation. Here's an example of how one would use the library now (excerpt from test_cache.py):

lsh = Cache(MinHasher(seeds=200), num_bands=20)

a_doc = 'This is a simple document'
another_doc = 'Some text about animals.'

lsh.add_doc(a_doc, doc_id=0)
lsh.add_doc(another_doc, doc_id=1)
lsh.add_doc(a_doc, doc_id=2)

assert lsh.is_duplicate(another_doc)
assert lsh.get_duplicates_of(a_doc) == {0, 2}
assert lsh.get_all_duplicates() == {(0, 2)}

Tests are

mattilyra commented 8 years ago

This needs to be chopped up into smaller PRs - reviewing and commenting on all the changes under one PR is very impractical.

mbatchkarov commented 8 years ago

I agree, but it would take ages to break down all the changes into multiple PRs because they are quite intertwined. It would be quite hard to go back and undo changes one by one. I am happy to talk you through the PR if you want me to.

mbatchkarov commented 7 years ago

As discussed, here is a list of changes: