Closed zzsi closed 5 years ago
Added. Some points that came out of this:
Closing the issue now, but reopen if it doesn't get at what you're after.
We're also going to want to avoid using this method if a document label is detected, since they're all going to look perceptually similar.
Similar to
find_duplicates
which uses md5 hash, sometimes we also want to find near duplicates generated by:This can be done by perceptual hash (e.g. https://github.com/jgraving/imagehash), or nearest neighbor + pretrained embedding.
The goal is to reduce train-test data leakage: duplicate or near duplicate examples sitting in both training and test set.