qarmin / czkawka

Multi functional app to find duplicates, empty folders, similar images etc.
Other
18.36k stars 604 forks source link

Use two photos as input to determine best algorithm to identify duplicates. #1199

Open janjuaasif opened 5 months ago

janjuaasif commented 5 months ago

The software did not identify the attached photos as duplicates. I tried different algorithms but could not get the software to recognize the photos as duplicates. It would be nice if a user had the ability to reference two photos and let the Sofware recommend an algorithm. 20140921_155122 20140921_155122-edited

...

qarmin commented 5 months ago

All algorithms properly recognize this images as duplicates after rotating first image.

First image is horizontal image and is rotated by exif data, which is not yet supported in czkawka.

Screenshot from 2024-02-01 18-02-36

janjuaasif commented 5 months ago

Do I have to manually rotate all images first to get the software to recognize duplicates? How do I get around this issue?

qarmin commented 5 months ago

Yes - at least for now.

Yesterday I tested automatic rotation basing on exif data, which sometimes works, but not for this two images.

One image have broken exif data(exiftool shows warning "Skipped unknown 7 bytes after JPEG APP1 Segment") and library which I wanted to use https://crates.io/crates/kamadak-exif cannot read this data and shows error - "Truncated IFD", so currently I don't have access to any rust library which support such files.

radialmonster commented 4 months ago

An option you could offer is to have the program rotate each image each of the 90 degree (0, 90, 180, 270) and check that this way it gets around the issue of invalid roation data, it will just go ahead and check all rotations. . This would 4x increase the time to analyze the images though, so you could have it as an option for the user to do maybe if they wanted to.

qarmin commented 4 months ago

Creating 4 hashes of 1 file, may slows down program even more than 4 times.

Calculating image hashes and saving/loading cache from file probably will be ~3/4 times slower, but hash comparing will be slower even more, because entire(quite now optimized) algorithm will need to be rewritten to handle more edge cases which will result in much worse performance.