Closed vyaslkv closed 4 years ago
is there any preprocessing step we could do so that I could find duplicates in the above case
@vyaslkv I suggest you set scores
parameter of find_duplicates
function to True. This would return the hamming distances (for hashing) or cosine similarity (for CNN) between each image file and its duplicates. Once you have this information, you can then try to tinker with the threshold parameter such that it accommodates your definition of duplicates.
Check out documentation for this.
@tanujjain for reply I am already finding scores (for CNN) but some other images are coming with high scores above this which are not duplicate
@vyaslkv Would it be possible to share some examples of images that turn up with a higher score than you expect?
(0.91523653, '19123694.jpg'), (0.9025427, '19381817.jpg'), (0.8944757, '21396687.jpg'), (0.8935419, 'eg.png'),
these are the top results for the above image and eg.png is the real match (that should have matched)
And when below is the image the duplicates are coming as below: (0.8899983, '21396620.jpg'), (0.8691303, '21396621.jpg'), (0.8638452, '19123671.jpg'), (0.84464777, '21396667.jpg'), (0.8376537, '21396742.jpg'), (0.8362152, '21396687.jpg'), (0.831527, '21396633.jpg'), (0.83019626, 'eg1.png')
The match is coming at the 8th place
This behaviour comes from the restricted abilities of the pre-trained CNN. Looking at the problem you're trying to solve, hashing algorithms are likely to work better (i.e., closest images should turn out to be what you expect). Ran the deduplication on all the images you posted using phash with a max_distance_threshold of 20, and the two images from the last comment ('In Gabriel synthesis ...') , are found to be top duplicates of each other with hamming distance of 18.
From an algorithm perspective, the result also makes sense since hashing algorithm specifically target outlines compared to CNNs which also pay attention to texture information and other finer patterns.
Any more changes in this direction would require looking at addition of more algorithms which, at the moment, is not planned.
Hope this addresses your problem. Closing the issue for now.