Comparing heatmap images

janstrauss1 commented 5 years ago

Dear @mlampros,

I am keen to use OpenImageR to compare two gene expression heatmaps using image hashing. A similar question has been asked at https://stackoverflow.com/questions/43924587/cluster-find-similar-heatmap-figures-using-python.

Eventually, I would like to show the differences/similarities between heatmap images similar to the bar plot shown at https://www.kleemans.ch/comparing-images. 61t

Is there a convenience function to do this with OpenImageR?

Many thanks in advance!

mlampros commented 5 years ago

hi @janstrauss1,

based on you information you have two gene expression heatmap images. You perform image hashing using one of the functions in OpenImageR ( as explained for instance for 'dhash' in my blog post -- towards the end ). For each image you'll receive (depending on your parameter setting) either a numeric vector (binary) or a hash value in order to compare the two images ( I think the binary vector is closer to what is done in the blog post you've mentioned ). So I guess then it's a matter of calculating the difference between the two binary vectors to come to the bar plot. Let me know if it's what you are after.

janstrauss1 commented 5 years ago

Hi @mlampros,

yes, that's correct! I would like to calculate the differences between numeric vectors from dhash (or other image hashing functions of OpenImageR) and plot the results as a bar plot.

Thanks in advance for your help!

janstrauss1 commented 5 years ago

Hi @mlampros,

To illustrate what I'm trying to do. I use the dhash function as explained in the blog post to compare the attached heatmaps. All heatmaps have the same consistent row ordering.

After applying dhash, I create a matrix with the the numeric binary vectors that I get and calculate the differences between the binary vectors (matrix) using the dist.binary function from the ade4 package. I've attached my code below.

From visual inspection of my initial heatmaps, I'd expect that the heatmaps ht2 and ht3 are most similar to each other. However, surprisingly the result that I get from dist.binary don't seem to reflect that.

Am I doing something wrong here?

Many thanks in advance for your help!

## read heatmap images
ht1_image = readImage('ht1.png')
ht2_image = readImage('ht2.png')
ht3_image = readImage('ht3.png')
ht4_image = readImage('ht4.png') 

## convert RGB image to Gray
ht1_image_gray = rgb_2gray(ht1_image)
ht2_image_gray = rgb_2gray(ht2_image)
ht3_image_gray = rgb_2gray(ht3_image) 
ht4_image_gray = rgb_2gray(ht4_image) 
#imageShow(ht1_image_a)

## set hash size
hash_size <- 8 # default hash_size

dh_hash_ht1 = dhash(ht1_image_gray, hash_size = hash_size, MODE = 'hash', resize = "bilinear")
dh_hash_ht2 = dhash(ht2_image_gray, hash_size = hash_size, MODE = 'hash', resize = "bilinear")
dh_hash_ht3 = dhash(ht3_image_gray, hash_size = hash_size, MODE = 'hash', resize = "bilinear")
dh_hash_ht4 = dhash(ht4_image_gray, hash_size = hash_size, MODE = 'hash', resize = "bilinear")

dh_bin_ht1 = dhash(ht1_image_gray, hash_size = hash_size, MODE = 'binary', resize = "bilinear")
dh_bin_ht2 = dhash(ht2_image_gray, hash_size = hash_size, MODE = 'binary', resize = "bilinear")
dh_bin_ht3 = dhash(ht3_image_gray, hash_size = hash_size, MODE = 'binary', resize = "bilinear")
dh_bin_ht4 = dhash(ht4_image_gray, hash_size = hash_size, MODE = 'binary', resize = "bilinear")

dh_bin_ht1 <- as.vector(dh_bin_ht1)
dh_bin_ht2 <- as.vector(dh_bin_ht2)
dh_bin_ht3 <- as.vector(dh_bin_ht3)
dh_bin_ht4 <- as.vector(dh_bin_ht4)

## create matrix
my.mat <- rbind(dh_bin_ht1, dh_bin_ht2, dh_bin_ht3, dh_bin_ht4)

library(ade4, lib.loc = "~/Documents/Rpackages") ## to calculate similarity coefficients
## computation of Distance Matrices for Binary Data
dist.binary(df = my.mat, 
                   method = 6, # select Hamann coefficient
                   )

# dh_bin_ht1 dh_bin_ht2 dh_bin_ht3 dh_bin_ht4
# dh_bin_ht1  0.0000000                                 
# dh_bin_ht2  0.2500000  0.0000000                      
# dh_bin_ht3  0.5303301  0.4677072  0.0000000           
# dh_bin_ht4  0.3952847  0.3952847  0.5000000  0.0000000

ht1 ht2 ht3 ht4

mlampros commented 5 years ago

hi @janstrauss1,

thanks for the sample images they were helpful. The author of the blog-post that you've mentioned does not use any distance metric to calculate the bits difference. Based on the data in his blog,


vec1 = c(0,0,1,0,1,1,1,0,0,1,1,1,0,1,0,1,1,1,0,0,0,1,0,1,1,0,1,0,0,0,1,1,1,1,0,0,0,1,1,1,1,1,0,0,1,1,0,1,0,1,0,0,1,1,0,1,0,1,0,0,1,1,1,0)
vec2 = c(0,0,1,0,1,1,1,0,0,1,1,0,0,1,1,1,1,1,0,0,0,1,0,1,1,0,1,0,0,0,1,1,1,1,0,0,0,1,1,1,1,1,0,0,1,1,0,1,0,1,0,0,1,1,0,1,0,1,0,0,1,1,1,0)

length(vec1) == length(vec2)

 # bits difference based on his calculation approximately : 96.875%

sum(diag(prop.table(table(vec1, vec2))))
[1] 0.96875

Another thing to consider is the hash_size. By default I use in the dhash function a hash_size of 8. This value might be appropriate for regular images (like the koalas in the blog-post) but probably is not for your heatmaps.

The OpenImageR::dhash() function uses either the 'nearest' or the 'bilinear' method to resize the input images. You've chosen the bilinear method in your example images, which under the hood uses the Rcpp function resize_bilinear_rcpp. The Rcpp dhash function will resize you initial image of size c(3543, 1181, 4) to an image of size (hash_size, hash_size + 1) = (8,9), and based on this size your heatmaps will look the following way,


# example 
OpenImageR:::resize_bilinear_rcpp(ht1_image_gray, hash_size, hash_size + 1)

dhash_images_hash_size_8

So based on a hash_size of 8 you'll receive a bits difference of,


sum(diag(prop.table(table(dh_bin_ht1, dh_bin_ht2))))
[1] 0.96875
sum(diag(prop.table(table(dh_bin_ht1, dh_bin_ht3))))
[1] 0.859375
sum(diag(prop.table(table(dh_bin_ht1, dh_bin_ht4))))
[1] 0.921875
sum(diag(prop.table(table(dh_bin_ht2, dh_bin_ht3))))
[1] 0.890625
sum(diag(prop.table(table(dh_bin_ht2, dh_bin_ht4))))
[1] 0.921875
sum(diag(prop.table(table(dh_bin_ht3, dh_bin_ht4))))
[1] 0.875

The results are in accordance with you calculation based on the haman distance.

But if you change the hash_size to 32 you will receive the following images,

dhash_images_hash_size_32

I guess, that with a size from (3543, 1181) down to (32, 33) you can visually differentiate between the 4 images and see the similarities of ht2 and ht3 (something which wasn't actually the case with a hash_size of 8). For this case you'll receive a better result for the pair (ht2, ht3) but again there is a small difference to the ht1 image,


sum(diag(prop.table(table(dh_bin_ht1, dh_bin_ht2))))
[1] 0.9726562
sum(diag(prop.table(table(dh_bin_ht1, dh_bin_ht3))))
[1] 0.9697266
sum(diag(prop.table(table(dh_bin_ht1, dh_bin_ht4))))
[1] 0.9658203
sum(diag(prop.table(table(dh_bin_ht2, dh_bin_ht3))))
[1] 0.9775391
sum(diag(prop.table(table(dh_bin_ht2, dh_bin_ht4))))
[1] 0.9736328
sum(diag(prop.table(table(dh_bin_ht3, dh_bin_ht4))))
[1] 0.96875

I'm not an expert in gene expression heatmaps, I mostly wanted to find out if the dhash function works properly as it's a port of the Imagehash python library.

I computed also the difference for the two koalas in the blog-post using the python script and the OpenImageR dhash function. In python based on the author's script I received a difference 3 out of 16 ( for hash results ) whereas in OpenImageR::dhash I received 4 out of 16. This probably has also to do with the fact that the author uses the PIL image libary with Image.ANTIALIAS enabled whereas I used the 'bilinear' method. I hope it helped.

janstrauss1 commented 5 years ago

Hi @mlampros,

many thanks for your help and additional information! It helps a lot!

I was already playing around with the hash_size argument but wasn't that clear yet how to set it properly.

Best, @janstrauss1

janstrauss1 commented 5 years ago

Hi @mlampros,

I'm trying to reproduce your plots for the resized images but I am still unclear how you plot the images for the resized images to assess the effect of hash_size. Would you share your code on how to plot the image lists below?

Also, I'm currently not fully clear about the 'nearest' and the 'bilinear' method for dhash to make an informed choice which method to use for my case. Could you please point me into some direction?

Thanks again for your great help! @janstrauss1

The OpenImageR::dhash() function uses either the 'nearest' or the 'bilinear' method to resize the input images. You've chosen the bilinear method in your example images, which under the hood uses the Rcpp function resize_bilinear_rcpp. The Rcpp dhash function will resize you initial image of size c(3543, 1181, 4) to an image of size (hash_size, hash_size + 1) = (8,9), and based on this size your heatmaps will look the following way,
# example 
OpenImageR:::resize_bilinear_rcpp(ht1_image_gray, hash_size, hash_size + 1)
But if you change the hash_size to 32 you will receive the following images,

I guess, that with a size from (3543, 1181) down to (32, 33) you can visually differentiate between the 4 images and see the similarities of ht2 and ht3 (something which wasn't actually the case with a hash_size of 8).

mlampros commented 5 years ago

hi @janstrauss1,

the code to reproduce the plots is the following ( reference : http://www.cookbook-r.com/Graphs/Multiple_graphs_on_onepage(ggplot2)/),


h_ht1 = OpenImageR:::resize_bilinear_rcpp(ht1_image_gray, hash_size, hash_size + 1)
h_ht2 = OpenImageR:::resize_bilinear_rcpp(ht2_image_gray, hash_size, hash_size + 1)
h_ht3 = OpenImageR:::resize_bilinear_rcpp(ht3_image_gray, hash_size, hash_size + 1)
h_ht4 = OpenImageR:::resize_bilinear_rcpp(ht4_image_gray, hash_size, hash_size + 1)

plot_function = function(list_images, plot_rows, plot_cols) {

  graphics::par(mfrow = c(plot_rows, plot_cols))
  nams = names(list_images)
  lapply(1:length(list_images), function(i) {
    graphics::plot(1:nrow(list_images[[i]]), type='n', xlim = c(1, nrow(list_images[[i]])), ylim = c(nrow(list_images[[i]]), 1), main = nams[i])
    graphics::rasterImage( OpenImageR::flipImage( OpenImageR::NormalizeObject(list_images[[i]])), nrow(list_images[[i]]), ncol(list_images[[i]]), 1, 1 )
  })
}

# Example:
plt = plot_function(list_images = list(h_ht1, h_ht2, h_ht3, h_ht4), plot_rows = 2, plot_cols = 2)

Normally the bilinear method should give better results than nearest ( the nearest method just removes rows and columns of the input image whereas the bilinear method performs also interpolation (reference). In my blog post about the OpenImageR package in one of my references a user have made a benchmark with various resizing methods and concluded that bicubic gave the best results.

stale[bot] commented 5 years ago

This is Robo-lampros because the Human-lampros is lazy. This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs. Feel free to re-open a closed issue and the Human-lampros will respond.

mlampros / OpenImageR

Comparing heatmap images #14