Deduping large amount of images (1mil+)

idealo / imagededup

😎 Finding duplicate images made easy!

https://idealo.github.io/imagededup/

Apache License 2.0

5.18k stars 459 forks source link

Deduping large amount of images (1mil+) #191

Open FluffyDiscord opened 1 year ago

FluffyDiscord commented 1 year ago

AFAIK we can encode single or multiple images at once, collect all encodings and then we need to pass the whole encoding dictionary to find dupes.

Would it be possible to add options to save encodings to file in format such as HDF5, then reference this file as dictionary for deduping and make it so that the whole file will not be loaded in memory, but be streamed/work in batches/loops - for example load first 1000 encodings, try to find dupes in this batch, hold onto only the closest scores, then repeat for next 1000 encodings etc?

I already have around half a million images that will grow up to ~4mil and I need to dedupe them all. Also I need to be able to check for duplicates for single image, when it is added later on. Running the whole encoding/dedupe process for each single new image or loading saved encodings all in memory is not an option in these conditions.

tanujjain commented 1 year ago

We're currently experimenting with some large scale similarity frameworks that should be able to handle approximate deduplication of 1 million+ images and hope to make a release in 2-3 months. Some of these frameworks have the ability to handle memory constraints already.

Streaming is a good idea for memory reduction, but will most likely also come with reduction in deduplication quality. From the point of view of feature planning, we'd prefer to finish experimentation with large scale similarity frameworks before looking into streaming.

I'll leave the issue open to keep track of the request.

FluffyDiscord commented 1 year ago

I am available for testing purposes if needed as I already have huge collection of images ready to be deduplicated. My PC setup: RTX4090, 32GB RAM and Windows or Linux (PopOS)

Thank you for your time

Joshfindit commented 1 year ago

One technique I currently use for deduplication bit-for-bit files is hardlinking on-drive. It works excellently for large datasets as long as you architect it with that in mind.

To take an example from git: filename is hash + filesize, and the files are stored in subfolders that are the start of the filename (this avoids OS issues when a single folder has “too many files”). So a 200KB file with the SHA hash of cd611130182d1b9bd84955e07ca5270df9a09640 becomes cd/61/11/30/18/cd611130182d1b9bd84955e07ca5270df9a09640.200000

lookups are at drive speed when comparing a file that’s just been hashed.

This does not cover images that share a perceptual hash or are perceptually the same, but a script could be written with the same concepts and in a way that uses very minimal memory as a short-term tool until imagededup can handle pools that large.

juhonkang commented 1 year ago

@Joshfindit could we connect, I also have the same questions for large dataset and want to ask you :)

Joshfindit commented 1 year ago

@juhonkang Sure. Emailed your gmail.

ming076 commented 1 year ago

@tanujjain Excuse me， I wonder is the release to dedupe large amount of images avaliable now?

jzx-gooner commented 8 months ago

@tanujjain Cool work！Looking forward the new release and i can help to test！

sezan92 commented 1 month ago

is this feature released for large datasets?