Open FluffyDiscord opened 1 year ago
We're currently experimenting with some large scale similarity frameworks that should be able to handle approximate deduplication of 1 million+ images and hope to make a release in 2-3 months. Some of these frameworks have the ability to handle memory constraints already.
Streaming is a good idea for memory reduction, but will most likely also come with reduction in deduplication quality. From the point of view of feature planning, we'd prefer to finish experimentation with large scale similarity frameworks before looking into streaming.
I'll leave the issue open to keep track of the request.
I am available for testing purposes if needed as I already have huge collection of images ready to be deduplicated. My PC setup: RTX4090, 32GB RAM and Windows or Linux (PopOS)
Thank you for your time
One technique I currently use for deduplication bit-for-bit files is hardlinking on-drive. It works excellently for large datasets as long as you architect it with that in mind.
To take an example from git: filename is hash + filesize, and the files are stored in subfolders that are the start of the filename (this avoids OS issues when a single folder has “too many files”). So a 200KB file with the SHA hash of cd611130182d1b9bd84955e07ca5270df9a09640
becomes cd/61/11/30/18/cd611130182d1b9bd84955e07ca5270df9a09640.200000
lookups are at drive speed when comparing a file that’s just been hashed.
This does not cover images that share a perceptual hash or are perceptually the same, but a script could be written with the same concepts and in a way that uses very minimal memory as a short-term tool until imagededup can handle pools that large.
@Joshfindit could we connect, I also have the same questions for large dataset and want to ask you :)
@juhonkang Sure. Emailed your gmail.
@tanujjain Excuse me, I wonder is the release to dedupe large amount of images avaliable now?
@tanujjain Cool work!Looking forward the new release and i can help to test!
is this feature released for large datasets?
AFAIK we can encode single or multiple images at once, collect all encodings and then we need to pass the whole encoding dictionary to find dupes.
Would it be possible to add options to save encodings to file in format such as HDF5, then reference this file as dictionary for deduping and make it so that the whole file will not be loaded in memory, but be streamed/work in batches/loops - for example load first 1000 encodings, try to find dupes in this batch, hold onto only the closest scores, then repeat for next 1000 encodings etc?
I already have around half a million images that will grow up to ~4mil and I need to dedupe them all. Also I need to be able to check for duplicates for single image, when it is added later on. Running the whole encoding/dedupe process for each single new image or loading saved encodings all in memory is not an option in these conditions.