Not able to find duplicates if the images are taken from different sources

idealo / imagededup

😎 Finding duplicate images made easy!

https://idealo.github.io/imagededup/

Apache License 2.0

5.12k stars 457 forks source link

Not able to find duplicates if the images are taken from different sources #86

Closed vyaslkv closed 4 years ago

vyaslkv commented 4 years ago

19123686

vyaslkv commented 4 years ago

is there any preprocessing step we could do so that I could find duplicates in the above case

tanujjain commented 4 years ago

@vyaslkv I suggest you set scores parameter of find_duplicates function to True. This would return the hamming distances (for hashing) or cosine similarity (for CNN) between each image file and its duplicates. Once you have this information, you can then try to tinker with the threshold parameter such that it accommodates your definition of duplicates.

Check out documentation for this.

vyaslkv commented 4 years ago

@tanujjain for reply I am already finding scores (for CNN) but some other images are coming with high scores above this which are not duplicate

tanujjain commented 4 years ago

@vyaslkv Would it be possible to share some examples of images that turn up with a higher score than you expect?

vyaslkv commented 4 years ago

(0.91523653, '19123694.jpg'), (0.9025427, '19381817.jpg'), (0.8944757, '21396687.jpg'), (0.8935419, 'eg.png'),

these are the top results for the above image and eg.png is the real match (that should have matched) 19123694 19381817 21396687

vyaslkv commented 4 years ago

And when below is the image 19123669 (1) the duplicates are coming as below: (0.8899983, '21396620.jpg'), (0.8691303, '21396621.jpg'), (0.8638452, '19123671.jpg'), (0.84464777, '21396667.jpg'), (0.8376537, '21396742.jpg'), (0.8362152, '21396687.jpg'), (0.831527, '21396633.jpg'), (0.83019626, 'eg1.png')

21396633 21396667 21396687 (1) 21396742

The match is coming at the 8th place

tanujjain commented 4 years ago

This behaviour comes from the restricted abilities of the pre-trained CNN. Looking at the problem you're trying to solve, hashing algorithms are likely to work better (i.e., closest images should turn out to be what you expect). Ran the deduplication on all the images you posted using phash with a max_distance_threshold of 20, and the two images from the last comment ('In Gabriel synthesis ...') , are found to be top duplicates of each other with hamming distance of 18.

From an algorithm perspective, the result also makes sense since hashing algorithm specifically target outlines compared to CNNs which also pay attention to texture information and other finer patterns.

Any more changes in this direction would require looking at addition of more algorithms which, at the moment, is not planned.

Hope this addresses your problem. Closing the issue for now.