Closed mbatchkarov closed 7 years ago
This needs to be chopped up into smaller PRs - reviewing and commenting on all the changes under one PR is very impractical.
I agree, but it would take ages to break down all the changes into multiple PRs because they are quite intertwined. It would be quite hard to go back and undo changes one by one. I am happy to talk you through the PR if you want me to.
As discussed, here is a list of changes:
lsh
to cache
to avoid import erroris_duplicate
method. It was used in doctests but was not actually present in the code. get_duplicates_of
method. Based on code from the example notebook. All candidates are also now subjected to a min Jaccard similarity test.get_al_duplicates
method. This is also based on code from the notebook and runs in less that O(N^2).remove
(by id or by content) and clear
methods. These allow a document to be removed from the cache. The use case is something like "this doc was last seen a month ago, so maybe we should not report it is a duplicate".meta
keyword, replace by doc_id
. Users can keep track of their own metainformation.bands
/ bins
confusion. Only one of these is necessary.serialisation to and from JSON + tests
Tests (100% coverage):
test_cache
: cache configured to err on the side of high recall. Check if replacing the last word of a document makes it a duplicate. May fail every now and then because the cache is probabilistic.test_num_bands
: add near-duplicates to caches with different number of bins in a loop. Check that the number of times a doc is picked up as a duplicate in a non-decreading function of the number of bands. Averaging over a few iterations helps get over the issue of the previous test.tes_real_world_usage
: an excuse to insert a StarCraft reference into an open-source project.
Hi,
This is a preliminary PR that adds a few utility functions and a bunch of unit tests. I do not expect it to be merged yet, especially until I've updated the documentation. At this point I am interested in whether these changes match your vision of the project and make it easier to use. If they do, I'll spend some more time on the documentation. Here's an example of how one would use the library now (excerpt from
test_cache.py
):Tests are