mwaskom / seaborn

Statistical data visualization in Python
https://seaborn.pydata.org
BSD 3-Clause "New" or "Revised" License
12.54k stars 1.92k forks source link

Feature Request: Optimal leaf ordering option for linkage in clustermap #1844

Open votti opened 5 years ago

votti commented 5 years ago

Enhancement suggestion: Adding optimal leaf ordering as an option for clustermap

Background: Many users are unaware that the leaf ordering in hierarchical clustering without any explicit leaf ordering is fairly arbitrary, as there are 2^(n-1) options to order a dendrogram. This often causes pattern across subtrees to be arbitrarily broken up, which can be a sub optimal data representation. The solution is to optimize the leaf ordering, such that the distance between successive leafs is as small as possible.

This has been recognized in the literature (https://academic.oup.com/bioinformatics/article/17/suppl_1/S22/261423) and also addressed in scipy (from v1.0, https://github.com/scipy/scipy/issues/5238, https://github.com/scipy/scipy/pull/7501) based on the polo implementation (https://github.com/adrianveres/polo). As far as I have experienced or could find out, this is not considered in the currently default used fastcluster package.

Apart from the additional computational cost, there seems to be no real downside of enabling optimal ordering. It just improves a previously random order, making clustered heatmaps smoother and the visualization more meaningful. I experienced several times in real world datasets that this can reveal significantly more structure in heatmaps.

Examples: https://github.com/adrianveres/polo/blob/master/data/demo.png http://www.cs.cmu.edu/~zivbj/compBio/ismb01/optimal.html

The additional computational cost is <1s for less than 512 leafs (https://github.com/adrianveres/polo/blob/master/data/bench.png), arguably not an issue for small visualizations.

While it is possible to just pre-calculate the optimal ordering linkage via scipy.cluster.hierarchy.linkage and the optimal_ordering argument separately, a lot of users are likely not even aware that optimal leaf ordering is something to consider. Thus, I would really argue that the benefit of having this more optimal clustering more easily available, e.g. via an 'order_dendrogram_optimal' option, would be beneficial for many and would justify the additional complexity of having an additional optional argument.

Thanks for your hard work and for considering this suggestion.

mwaskom commented 5 years ago

Apart from the additional computational cost

How much is this, in practice?

votti commented 5 years ago

@Benchmark: As the scipy implementation is a direct copy of the polo cython-code, I think their benchmark should still apply: image (from https://github.com/adrianveres/polo/blob/master/data/bench.png) -> <=1s for dendrograms with up to 2**9 = 512 leafs

This is consistent with my practical experience that it does not add a noticeable delay when plotting interactively for problems around this size.

It should also be easy to modify the polo benchmarking to use the current scipy implementation : https://github.com/adrianveres/polo/blob/7c1fe6893d4874a0097f10c42f4a86c5ca77388e/polo/test.py

Note that scipy also offers the optimal ordering as a standalone function, thus this could also be done optionally downstream of clustering via fastcluster: https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.optimal_leaf_ordering.html#scipy.cluster.hierarchy.optimal_leaf_ordering

Edit: If you would consider including this feature, I am also happy to look into the implementation myself - however, this won't be before in 2 months.

mwaskom commented 5 years ago

I think it's a good idea, but handling the fact that it doesn't exist on our minimal supported version of scipy or in fastcluster will be annoying.

rgommers commented 4 years ago

@mwaskom https://github.com/scipy/scipy/issues/5238 is part of SciPy 1.0, which is your minimum version says setup.py.

@votti there's a very recent question at https://github.com/scipy/scipy/issues/5238#issuecomment-653271249 about making the SciPy implementation faster. If you have any thoughts on that, it would be great if you could comment there.

mwaskom commented 4 years ago

@mwaskom scipy/scipy#5238 is part of SciPy 1.0, which is your minimum version says setup.py.

Yep those have been bumped recently so we would be good to go on that front.

I did try to use this code recently for a plot I was making, and I found it to be rather slow in practice. This will feel like a double-hit to users if their main plot goes through fastcluster, and then by asking for optimal leaf ordering they a) have to use the scipy clustering implementation and b) add on the overhead of the optimal ordering.

But I continue to think it's a good idea in principle.

Remember that you can supply precomputed linkage matrices to clustermap...

michalkahle commented 3 years ago

I'd also plead for optimal ordering.