nice to have `find_near_duplicates()` in image utilities

kungfuai / kaishi

Tool kit to accelerate exploratory data analysis and data cleaning

https://kaishi.readthedocs.io/en/latest/

MIT License

11 stars 2 forks source link

Closed zzsi closed 5 years ago

zzsi commented 5 years ago

Similar to find_duplicates which uses md5 hash, sometimes we also want to find near duplicates generated by:

resizing a photo and then keeping multiple resolutions of the same image
extracting video frames that are very close in timestamp, or no events happening in between

This can be done by perceptual hash (e.g. https://github.com/jgraving/imagehash), or nearest neighbor + pretrained embedding.

The goal is to reduce train-test data leakage: duplicate or near duplicate examples sitting in both training and test set.

mwharton3 commented 5 years ago

Added. Some points that came out of this:

For efficiency, I'm only loading images as a 64x64 thumbnail. This will have to be overridden later if we're detecting compression artifacts (maybe just keep a 64x64 patch as a sample?)
find_near_duplicates_by_value from kaishi.util.misc is going to be slow for image datasets > ~1000. Will need to optimize this at some point.

Closing the issue now, but reopen if it doesn't get at what you're after.

mwharton3 commented 5 years ago

We're also going to want to avoid using this method if a document label is detected, since they're all going to look perceptually similar.