idealo / imagededup

😎 Finding duplicate images made easy!
https://idealo.github.io/imagededup/
Apache License 2.0
5.16k stars 456 forks source link

How does it perform with document images like invoices, receipts #85

Closed NISH1001 closed 4 years ago

NISH1001 commented 4 years ago

Thanks for this awesome (and seamless) module.

I want to know if it performs equally "good" in document images. Ongoing through the code (and the blog), it seems the DHasher rescales every to image to (9, 8) and then code is generated using a direct horizontal gradient. This seems to do better for documents consists of abstract entities like person, car without much care for details (since even after rescaling to such narrow size our eyes can still differentiate the resulting image).

However, for document images, even some differences in table structure or lines can make it belong to different categories which after rescaling might be lost.

So, has evaluation been done on document images?

tanujjain commented 4 years ago

A detailed benchmarking on document images has not been done, but you can refer to the closed issue #80 where a similar problem was asked and the documents were found to be duplicates with the default settings. In general, CNN method is more robust to such variations compared to hashing methods.

Would be great if you can share some of your results on your document corpus.