python-pillow / Pillow

Python Imaging Library (Fork)
https://python-pillow.org
Other
11.92k stars 2.2k forks source link

Perceptual hash #3120

Open thorade opened 6 years ago

thorade commented 6 years ago

For finding duplicates, it would be nice if Pillow would include some perceptual hashing algorithm: https://en.wikipedia.org/wiki/Perceptual_hashing

The real use case for me is that my in my photo collection I often have the same image in different resolutions, e.g. if I sent it via WhatsApp: Once in full resolution from the camera, once in reduced resolution in WhatsApp.

Here is a blog post describing a simple algorithm that also uses Pillow: https://www.safaribooksonline.com/blog/2013/11/26/image-hashing-with-python/ but it would be more convenient to have some function builtin, most interesting something like image.signature or image.phash and then some function to calculate the similarity or distance between two or more images.

One library mentioned quite often in this context is phash: http://www.phash.org/ Maybe their algorithms could be reused?

If this is out of scope for Pillow I would just use one the projects providing phash python bindings.

radarhere commented 6 years ago

Pillow actually already has code for comparing two images, used in the test suite. I've created a PR to move this into it's own method. If you have any thoughts, they would be welcomed.

thorade commented 6 years ago

Thanks for the answer, great to hear this function will be usable for endusers, too.

My use case is described above, finding images that are identical except they have been resized and possibly some color filter has been used. Based on what I found on the internet, I wrote this Jupyter notebook: https://github.com/thorade/jupyterNotebooks/blob/master/Pillow/dhash_hamming.ipynb It creates a hash from an image, and similar images have similar hashes, purely resized images have identical hashes.

In addition to the hash from image function, the Hamming distance function shown in the notebook is also helpful.

hugovk commented 5 years ago

I've hesitated merging PR #3254 because this feature request asked for a perceptual hashing function to compare functions, which if I understand, is not quite the same thing as average difference in the PR.

And by adding code to the API means we need to maintain it, which is fine if it's useful, but less so if not.

Having said that, it's useful to us as we use it in our tests, so I think I've just answered my own question!

What do others think?

thorade commented 5 years ago

I believe it makes sense to do this in two steps:

This allows to store the hashes, and it shoud be easier to compare the distance between multiple images. But I fully understand if this use case / workflow is out of scope for pillow. For my personal needs, I just implemented this myself here (as linked previously, maybe it explains my ideas better): https://github.com/thorade/jupyterNotebooks/blob/master/Pillow/dhash_hamming.ipynb

tanujjain commented 3 years ago

@thorade You can check out the python package imagededup that has the capability to find duplicates using perceptual hash.