mlampros / OpenImageR

Image processing Toolkit in R
https://mlampros.github.io/OpenImageR/
57 stars 10 forks source link

pHash and distances #2

Closed kkamila closed 7 years ago

kkamila commented 7 years ago

Hej, i tried to use your library to get phashes and calculate distances between photos. Unfortunately i get totally different hashes and distances than in example: http://cloudinary.com/blog/how_to_automatically_identify_similar_images_using_phash (while using python library i get exactly the same values as in example)

mlampros commented 7 years ago

Hello kkamila and I'm sorry for the delayed response,

I couldn't find the 'koala1.jpeg' from the blog but I downloaded a similar image from the web and the phash value for the OpenImageR package and imagehash python library is the following:

library(OpenImageR) res_hash = phash(image_koala, hash_size = 8, highfreq_factor = 4, MODE = 'hash', resize = 'bilinear') res_hash = "3bfadf09e13042c9"

import imagehash hash = imagehash.phash(Image.open(image_koala)) hash = "3bfadfa0e13042d1"

The differences of the two phashes in 4 places are due to the fact that the imagehash python library (default values of hash_size is 8 and highfreq_factor is 4) uses a different image resize method ANTIALIAS (a high-quality downsampling filter) rather than 'nearest' or 'bilinear' that the OpenImageR package offers.

please test it and let me know

kkamila commented 7 years ago

Thank you. Is there any possiblity you'll add this ANTIALIAS method to your package? or maybe is there any possibility to calculate this in R ? My main concern is that the hamming distance/ similarity score of two phases is changing between your and python package.

http://res.cloudinary.com/demo/image/upload/koala1.jpg "4bb3b541ebd5141a" http://res.cloudinary.com/demo/image/upload/koala2.jpg "4bb3a541ebd614b2"

Have different symbols on 4 places what gives as similarity score 0.75
Whereas as an example(http://cloudinary.com/blog/how_to_automatically_identify_similar_images_using_phash ) we have similarity score 0.96875 it makes kind a big difference.

mlampros commented 7 years ago

kkamila,

I think that if you stick with one of the programming languages such as ruby, python or R then you will get comparable results. I downloaded the following images and then I calculated the similarity between them in R. I don't know the parameter settings that the author used in the blog post but I used hash_size = 8, highfreq_factor = 6, MODE = 'binary' and resize = 'bilinear',

http://res.cloudinary.com/demo/image/upload/koala1.jpg http://res.cloudinary.com/demo/image/upload/koala2.jpg http://res.cloudinary.com/demo/image/upload/another_koala.jpg http://res.cloudinary.com/demo/image/upload/woman1.jpg

library(OpenImageR)

read images

image = readImage("koala1.jpg") image2 = readImage("koala2.jpg") image3 = readImage("another_koala.jpg") image4 = readImage("woman1.jpg")

hamming distance

ham_dist = function(x1, x2) {

sum(x1 != x2) / length(x1) }

first convert all images to gray (2-dimensional)

image = rgb_2gray(image) image2 = rgb_2gray(image2) image3 = rgb_2gray(image3) image4 = rgb_2gray(image4)

calculate the binary values for each image

res_hash = phash(image, hash_size = 8, highfreq_factor = 6, MODE = 'binary', resize = 'bilinear') res_hash2 = phash(image2, hash_size = 8, highfreq_factor = 6, MODE = 'binary', resize = 'bilinear') res_hash3 = phash(image3, hash_size = 8, highfreq_factor = 6, MODE = 'binary', resize = 'bilinear') res_hash4 = phash(image4, hash_size = 8, highfreq_factor = 6, MODE = 'binary', resize = 'bilinear')

simialrity of first with second image

similarity = 1.0 - ham_dist(as.vector(res_hash), as.vector(res_hash2)) similarity = 0.96875

simialrity of first with third image

similarity1 = 1.0 - ham_dist(as.vector(res_hash), as.vector(res_hash3)) similarity1 = 0.5625

simialrity of first with fourth image

similarity2 = 1.0 - ham_dist(as.vector(res_hash), as.vector(res_hash4)) similarity2 = 0.53125

You can also consider the _invarianthash method which does also rotation, crop and flip of the image and returns min and max values for either the hamming or the levenshtein distance:

res1 = invariant_hash(image, image2, method = "phash", mode = "binary", hash_size = 8, highfreq_factor = 6, resize = "bilinear", flip = T, rotate = T, angle_bidirectional = 10, crop = T) res1

mlampros commented 7 years ago

kkamila,

can I close this issue?

kkamila commented 7 years ago

Hey, sorrry for not answering for so long. In python package we have hash_size =8 and highfreq_factor=4, as one can see : https://github.com/JohannesBuchner/imagehash/blob/master/imagehash/__init__.py

My main issue is that i want to rewrite few scripts from python to R, and even though i knew i can still use system call to calculate phashes in python i didn`t want to do that. I wanted every line working in R.

So just anwer me if there is any possiblity you'll add this ANTIALIAS method to your package and i`ll close the issue

mlampros commented 7 years ago

kkamila,

I don't intend to implement the ANTIALIAS method in the near future.

kkamila commented 7 years ago

ok, thank you for you answers

deann88 commented 4 years ago

I also want to rewrite some python script fully in R and was wondering has this ANTIALIAS method been implemented, and is there a way to replicate the hashes from python in R?

mlampros commented 4 years ago

@deann88,

the OpenImageR package utilizes only the 'nearest' and 'bilinear' methods and I don't intend currently to implement any other method.

Now, in case you want to implement (replicate) the hashes from the python code in R on your own then I think one option is to use the reticulate package to open and resize the image as explained in the python code

deann88 commented 4 years ago

@mlampros ,

yep, I am using reticulate right now. Thank you for your answer and package. I would have used it initially, however, we already have a database of hashes to compare against, so I cannot switch at the moment.

Thanks