Clustering differences between heatmap.2 and pheatmap

raivokolde / pheatmap

Pretty heatmaps

226 stars 83 forks source link

Clustering differences between heatmap.2 and pheatmap #13

Open igordot opened 9 years ago

igordot commented 9 years ago

I just discovered pheatmap after using heatmap.2 for a while. In both tools, you can specify clustering settings. However, if I set those parameters to use the same algorithms, the resulting heatmaps do not look similar. I mean the rows are clearly in a very different order. How can that be? Does pheatmap perform additional manipulations that heatmap.2 does not?

Example code:

# pheatmap
pheatmap(vals, scale="row", cluster_rows=T, cluster_cols=T, clustering_distance_rows = "euclidean", clustering_distance_cols = "euclidean", clustering_method = "complete", color=colors)

# heatmap.2
hclust_fun = function(x) hclust(x, method="complete")
dist_fun = function(x) dist(x, method="euclidean")
heatmap.2( as.matrix(vals), scale="row", trace="none", dendrogram="both", Rowv=T, Colv=T, distfun=dist_fun, hclustfun=hclust_fun, col=colors)

raivokolde commented 9 years ago

I think this is vice versa, heatmap.2 applies some reordering to the dendrogram that is not done by pheatmap. Here is an excerpt from heatmap.2 manual.

If either is a vector (of “weights”) then the appropriate dendrogram is reordered according to the supplied values subject to the constraints imposed by the dendrogram, by reorder(dd, Rowv), in the row case. If either is missing, as by default, then the ordering of the corresponding dendrogram is by the mean value of the rows/columns, i.e., in the case of rows, Rowv <- rowMeans(x, na.rm=na.rm). If either is NULL, no reordering will be done for the corresponding side.

igordot commented 9 years ago

I saw that, but it's not entirely clear to me. Wouldn't hclustfun and distfun override that? By default (Rowv=T), it sounds like it would just sort by the mean values, which wouldn't make sense.

igordot commented 9 years ago

Actually, there is another way to cluster in heatmap.2:

distance.row = dist(as.matrix(vals), method = "euclidean")
cluster.row = hclust(distance.row, method = "complete")
distance.col = dist(t(as.matrix(vals)), method = "euclidean")
cluster.col = hclust(distance.col, method = "complete")
heatmap.2( as.matrix(vals), scale="row", trace="none", dendrogram="both", Rowv=as.dendrogram(cluster.row), Colv=as.dendrogram(cluster.col), col=colors)

The order is now somewhat different from the original heatmap.2 (it looks like certain sub-clusters are flipped), but overall it looks similar. It's still different from pheatmap result.

federicomarini commented 8 years ago

Chiming in, since I noticed that behaviour too.

I think it is rather that heatmap.2 plots the dendrogram and then scales the values, while pheatmap scales and then draws the dendrogram accordingly. My personal impression is that this second approach is somewhat more immediate to understand - or at least, does not leave me confused when I cluster scaled data.