idealo / imagededup

😎 Finding duplicate images made easy!
https://idealo.github.io/imagededup/
Apache License 2.0
5.17k stars 458 forks source link

MIN_similarity_thresholds #198

Open wangat opened 1 year ago

wangat commented 1 year ago

Thank you for your code. I'm comparing Data cleansing libraries such as imagedups, fastdup, and imagededup. When testing imagedump, various hash methods were tested to determine thresholds for different data. However, when testing the cnn method, we encountered some problems. Because my version of torchvision was earlier, I did not use vit and efficientnet, and instead used the default mobilenetv3. However, it was found that different MIN_similarity_thresholds were set, ranging from 0.1 to 0.9, and no duplicate image was found (even the exact same image was used, or the duplicate image found by hash method was used). Later, the threshold was set to be negative, and the score was generally at the level of 1e-5. At the same time, the speed of using cnn method is particularly slow.

I'm sorry that I don't have enough time to study the code now, I would like to ask you if there is a wrong setting? Is it possible that my picture is too large to distinguish the dimensions? (25601440/19201080)

Thank you and look forward to your reply.

tanujjain commented 1 year ago

no duplicate image was found (even the exact same image was used, or the duplicate image found by hash method was used)

This is quite unlikely. If the exact same image is used, the same encodings would be generated and the similarity score would be 1.0. Could you try to reproduce the issue with some pictures used in testing the package here)?

the speed of using cnn method is particularly slow

That's expected, since cnn method requires a forward pass through a deep learning model which is much more computationally expensive than hashing methods available in the package. It could be much quicker if you use it on a GPU machine.

Is it possible that my picture is too large to distinguish the dimensions?

The preprocessing steps before feeding the image to cnn include resizing, cropping, etc. depending upon the cnn network itself. If the pictures are exactly the same, the same encodings should be generated since the preprocessing module receives the same input. However, if the images are quite different, it's possible that for large pictures, the preprocessing steps cut out significant info and hence, images create wildly different encodings. For context, the mobilenetv3 used in the package was pretrained on ImageNet-1K dataset, which has a much lower resolution images than the one you are dealing with. So, some performance degradation can be expected.