kungfuai / kaishi

Tool kit to accelerate exploratory data analysis and data cleaning
https://kaishi.readthedocs.io/en/latest/
MIT License
11 stars 2 forks source link

nice to have `find_near_duplicates()` in image utilities #1

Closed zzsi closed 5 years ago

zzsi commented 5 years ago

Similar to find_duplicates which uses md5 hash, sometimes we also want to find near duplicates generated by:

This can be done by perceptual hash (e.g. https://github.com/jgraving/imagehash), or nearest neighbor + pretrained embedding.

The goal is to reduce train-test data leakage: duplicate or near duplicate examples sitting in both training and test set.

mwharton3 commented 5 years ago

Added. Some points that came out of this:

Closing the issue now, but reopen if it doesn't get at what you're after.

mwharton3 commented 5 years ago

We're also going to want to avoid using this method if a document label is detected, since they're all going to look perceptually similar.