mwaskom / seaborn

Statistical data visualization in Python
https://seaborn.pydata.org
BSD 3-Clause "New" or "Revised" License
12.43k stars 1.91k forks source link

Clustermap dendrograms: intersecting lines #356

Closed cel4 closed 9 years ago

cel4 commented 9 years ago

It appears that for me the painting of the dendrogram sometimes is wrong (see attached image). Can you reproduce this issue?

import numpy as np
import seaborn as sns
np.random.seed(98)
x = np.random.rand(10,10)
sns.clustermap(x)

sns_clustermap

mwaskom commented 9 years ago

@olgabot ?

olgabot commented 9 years ago

I've seen that before, it's some weird issue with the way that scipy represents the dendrogram positions. @mlovci says its a scipy bug, that if you have certain figure sizes, the dendrogram gets resized weirdly. I'll look into this.

mwaskom commented 9 years ago

Might be a scipy bug, but I am able to reproduce it on my system which should use fastcluster?

mwaskom commented 9 years ago

I haven't tested this very broadly but it seems like the overlaps are eliminated with using method="average" rather than the default "median"?

olgabot commented 9 years ago

Huh that's bizarre that using different linkage methods makes it better. The default is median because it uses a vectorized version of fastcluster, but if it's doing that weird stuff it may not be worth it.

On Sat, Nov 8, 2014, 13:33 Michael Waskom notifications@github.com wrote:

I haven't tested this very broadly but it seems like the overlaps are eliminated with using method="average" rather than the default "median"?

— Reply to this email directly or view it on GitHub https://github.com/mwaskom/seaborn/issues/356#issuecomment-62270041.

mwaskom commented 9 years ago

FWIW method="average" seems slighter faster, which tracks with my intuition about the relative speeds of parametric/nonparametric estimators.

mwaskom commented 9 years ago

I seem to see these crossings for median and centroid but not average, single, or weighted.

olgabot commented 9 years ago

tl;dr: I'm seeing the same thing when using the raw scipy dendrogram function (using x as defined in the original post). It seems the problem stems from the original linkage matrix, and I think using either average or weighted, as they do not cause these crossings and are robust to outliers as they are an average statistic of the cluster, should be the default (see linkage methods on scipy).

Here's the full assessment:

from scipy.cluster import hierarchy
row_linkage = hierarchy.linkage(x, method='centroid')
sns.clustermap(x, row_linkage=row_linkage)

image

And I can confirm this with the raw scipy.cluster.hierarchy.dendrogram

hierarchy.dendrogram(row_linkage);

image

It seems as though the issue occurs at the linkage calculation step, not in scipy.cluster.hierarchy.dendrogram, because if you compare the scipy.cluster.hierarchy.inconstency metrics of centroid and average methods, you get

average_row_linkage = hierarchy.linkage(x, method='average')
centroid_row_linkage = hierarchy.linkage(x, method='centroid')
average_row_linkage
array([[  0.        ,   5.        ,   0.70612158,   2.        ],
       [  4.        ,  10.        ,   0.88539872,   3.        ],
       [  6.        ,   7.        ,   0.91576681,   2.        ],
       [  2.        ,   8.        ,   1.08650072,   2.        ],
       [  3.        ,  12.        ,   1.14157164,   3.        ],
       [  1.        ,  14.        ,   1.19329451,   4.        ],
       [ 13.        ,  15.        ,   1.23056439,   6.        ],
       [ 11.        ,  16.        ,   1.32367191,   9.        ],
       [  9.        ,  17.        ,   1.36049256,  10.        ]])
centroid_row_linkage
array([[  0.        ,   5.        ,   0.70612158,   2.        ],
       [  4.        ,  10.        ,   0.81357484,   3.        ],
       [  6.        ,   7.        ,   0.91576681,   2.        ],
       [ 11.        ,  12.        ,   0.99332674,   5.        ],
       [  8.        ,  13.        ,   0.95769878,   6.        ],
       [  3.        ,  14.        ,   1.05436388,   7.        ],
       [  2.        ,  15.        ,   1.05850306,   8.        ],
       [  9.        ,  16.        ,   1.14235696,   9.        ],
       [  1.        ,  17.        ,   1.11481294,  10.        ]])

The linkage matrices are of the format where:

To me, this looks like the centroid method "jumps the gun" by merging already-created clusters 11 and 12 in the 4th row (i=3, 0-based), while the average method keeps merging leaves until it runs out.

And if you calculate the scipy.cluster.hierarchy.inconsistent coefficient between them, which sees how different the height of a cluster is, vs other clusters that join the same number of leaves, is.

hierarchy.inconsistent(average_row_linkage)
array([[ 0.70612158,  0.        ,  1.        ,  0.        ],
       [ 0.79576015,  0.12676808,  2.        ,  0.70710678],
       [ 0.91576681,  0.        ,  1.        ,  0.        ],
       [ 1.08650072,  0.        ,  1.        ,  0.        ],
       [ 1.02866923,  0.15966812,  2.        ,  0.70710678],
       [ 1.16743308,  0.0365736 ,  2.        ,  0.70710678],
       [ 1.17011987,  0.07477555,  3.        ,  0.80834594],
       [ 1.146545  ,  0.23090104,  3.        ,  0.76711177],
       [ 1.34208223,  0.02603613,  2.        ,  0.70710678]])

Rounding done for easier visualization of array rows.

np.round(hierarchy.inconsistent(centroid_row_linkage), decimals=3)
array([[ 0.706,  0.   ,  1.   ,  0.   ],
       [ 0.76 ,  0.076,  2.   ,  0.707],
       [ 0.916,  0.   ,  1.   ,  0.   ],
       [ 0.908,  0.09 ,  3.   ,  0.951],
       [ 0.976,  0.025,  2.   , -0.707],
       [ 1.006,  0.068,  2.   ,  0.707],
       [ 1.056,  0.003,  2.   ,  0.707],
       [ 1.1  ,  0.059,  2.   ,  0.707],
       [ 1.129,  0.019,  2.   , -0.707]])

Where the matrix is structure as,

The inconsistency coefficient is calculated as:

We see that the average method has only positive distances, and the centroid method has a couple negative distances, which are probably creating the crossings. The negative distance means that the clusters are closer than the mean heights.

I looked into other linkage methods, and calculated some of the inconsistency coefficients by hand in this notebook, and from there, I think that using average or weighted, which don't have these inconsistency coefficient problems, would be best.

EDIT: Fixed fixed-with formatting errors.

mwaskom commented 9 years ago

Cool, thanks for giving this such a thorough look.

In trying to figure out what was going on here by myself, I noticed that the default for scipy.cluster.linkage is single. Do you think it's worth just keeping the scipy default so that people who are transitioning from using scipy directly don't get surprised?

olgabot commented 9 years ago

Single linkage isn't a very good method, since it takes the minimum distance between clusters and thus is sensitive to outliers. Complete linkage takes the maximum distance and again is sensitive to outliers. The "standard" algorithms I've seen used in papers is either UPGMA (average linkage) and WPGMA (weighted linkage) so I think those are best as defaults, since they'll work robustly in many datasets. I've also been laughed at for using single linkage so it may be the emotional scars talking.


Olga Botvinnik PhD Program in Bioinformatics and Systems Biology Gene Yeo Laboratory http://yeolab.ucsd.edu/yeolab/Home.html | Sanford Consortium for Regenerative Medicine University of California, San Diego www http://olgabotvinnik.com | blog http://blog.olgabotvinnik.com/ | github http://github.com/olgabot | twitter http://twitter.com/olgabot | linkedin http://www.linkedin.com/in/olgabotvinnik

2014-11-10 10:47 GMT-05:00 Michael Waskom notifications@github.com:

Cool, thanks for giving this such a thorough look.

In trying to figure out what was going on here by myself, I noticed that the default for scipy.cluster.linkage is single. Do you think it's worth just keeping the scipy default so that people who are transitioning from using scipy directly don't get surprised?

— Reply to this email directly or view it on GitHub https://github.com/mwaskom/seaborn/issues/356#issuecomment-62403225.

mwaskom commented 9 years ago

OK, let's go with average then.

cel4 commented 9 years ago

The problem could be that all leaves in the dendrogram start at the same level. To me this sounds like a strong assumption which can or cannot be supported by the different linkage methods. That's just a feeling, though. I'll have to think about it a little bit when I find some time.

olgabot commented 9 years ago

In any case, it's a bug in how scipy creates dendrograms, not how they're plotted.


Olga Botvinnik PhD Program in Bioinformatics and Systems Biology Gene Yeo Laboratory http://yeolab.ucsd.edu/yeolab/Home.html | Sanford Consortium for Regenerative Medicine University of California, San Diego www http://olgabotvinnik.com | blog http://blog.olgabotvinnik.com/ | github http://github.com/olgabot | twitter http://twitter.com/olgabot | linkedin http://www.linkedin.com/in/olgabotvinnik

2014-11-10 13:42 GMT-05:00 cel4 notifications@github.com:

The problem could be that all leaves in the dendrogram start at the same level. To me this sounds like a strong assumption which can or cannot be supported by the different linkage methods. That's just a feeling, though. I'll have to think about it a little bit when I find some time.

— Reply to this email directly or view it on GitHub https://github.com/mwaskom/seaborn/issues/356#issuecomment-62432200.

cel4 commented 9 years ago

It's actually documented here: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.dendrogram.html

The dendrogram illustrates how each cluster is composed by drawing a U-shaped link between a non-singleton cluster and its children. The height of the top of the U-link is the distance between its children clusters. It is also the cophenetic distance between original observations in the two children clusters. It is expected that the distances in Z[:,2] be monotonic, otherwise crossings appear in the dendrogram.

I don't think there is a bug in the linkage function, but the condition that the distances have to be monotonic only holds for some linkage methods.

olgabot commented 9 years ago

Huh interesting, nice find! Sounds like "average" is the way to go :)


Olga Botvinnik PhD Program in Bioinformatics and Systems Biology Gene Yeo Laboratory http://yeolab.ucsd.edu/yeolab/Home.html | Sanford Consortium for Regenerative Medicine University of California, San Diego www http://olgabotvinnik.com | blog http://blog.olgabotvinnik.com/ | github http://github.com/olgabot | twitter http://twitter.com/olgabot | linkedin http://www.linkedin.com/in/olgabotvinnik

2014-11-10 14:57 GMT-05:00 cel4 notifications@github.com:

It's actually documented here: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.dendrogram.html

The dendrogram illustrates how each cluster is composed by drawing a U-shaped link between a non-singleton cluster and its children. The height of the top of the U-link is the distance between its children clusters. It is also the cophenetic distance between original observations in the two children clusters. It is expected that the distances in Z[:,2] be monotonic, otherwise crossings appear in the dendrogram.

I don't think there is a bug in the linkage function, but the condition that the distances have to be monotonic only holds for some linkage methods.

— Reply to this email directly or view it on GitHub https://github.com/mwaskom/seaborn/issues/356#issuecomment-62444341.

cel4 commented 9 years ago

:+1: