idealo / imagededup

šŸ˜Ž Finding duplicate images made easy!
https://idealo.github.io/imagededup/
Apache License 2.0
5.18k stars 459 forks source link

Possible to de-dupe a library of 100-500 photos if they are not saved locally? #197

Closed jonathanrstern closed 1 year ago

jonathanrstern commented 1 year ago

All of my photos are saved on S3. Is there a way to use this library, or one like it, to identify duplicates?

tanujjain commented 1 year ago

Currently we don't provide such connectors.

Joshfindit commented 1 year ago

No matter what youā€™ll need to do egress to read the photos so you might as well download them locally for deduplication

jonathanrstern commented 1 year ago

@Joshfindit

Let's say you have an app with 1,000 users. Each has 1,000 photos, some of which are duplicates. All are saved to S3.

How would you go about de-duping?

Joshfindit commented 1 year ago

I probably wouldnā€™t end up with that because Iā€™d likely have built the app to store files on S3 with some sort of hash (my current go-to is <SHA256>.<size in bytes>).

If I walked in to that as a project Iā€™d:

  1. Plan how user permissions and access will be handled for deduplicated files
  2. Rent a server with enough storage space for all the files
  3. Download and deduplicate in a single batch, tracking the filenames of the dupes
  4. Update the app code
  5. Use the dupe info to ā€œdeduplicateā€ directly on S3
  6. Run confirmation tests