Open votti opened 5 years ago
Apart from the additional computational cost
How much is this, in practice?
@Benchmark:
As the scipy implementation is a direct copy of the polo
cython-code, I think their benchmark should still apply:
(from https://github.com/adrianveres/polo/blob/master/data/bench.png)
-> <=1s for dendrograms with up to 2**9 = 512 leafs
This is consistent with my practical experience that it does not add a noticeable delay when plotting interactively for problems around this size.
It should also be easy to modify the polo
benchmarking to use the current scipy
implementation : https://github.com/adrianveres/polo/blob/7c1fe6893d4874a0097f10c42f4a86c5ca77388e/polo/test.py
Note that scipy also offers the optimal ordering as a standalone function, thus this could also be done optionally downstream of clustering via fastcluster
:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.optimal_leaf_ordering.html#scipy.cluster.hierarchy.optimal_leaf_ordering
Edit: If you would consider including this feature, I am also happy to look into the implementation myself - however, this won't be before in 2 months.
I think it's a good idea, but handling the fact that it doesn't exist on our minimal supported version of scipy or in fastcluster will be annoying.
@mwaskom https://github.com/scipy/scipy/issues/5238 is part of SciPy 1.0, which is your minimum version says setup.py
.
@votti there's a very recent question at https://github.com/scipy/scipy/issues/5238#issuecomment-653271249 about making the SciPy implementation faster. If you have any thoughts on that, it would be great if you could comment there.
@mwaskom scipy/scipy#5238 is part of SciPy 1.0, which is your minimum version says setup.py.
Yep those have been bumped recently so we would be good to go on that front.
I did try to use this code recently for a plot I was making, and I found it to be rather slow in practice. This will feel like a double-hit to users if their main plot goes through fastcluster, and then by asking for optimal leaf ordering they a) have to use the scipy clustering implementation and b) add on the overhead of the optimal ordering.
But I continue to think it's a good idea in principle.
Remember that you can supply precomputed linkage matrices to clustermap...
I'd also plead for optimal ordering.
Enhancement suggestion: Adding optimal leaf ordering as an option for
clustermap
Background: Many users are unaware that the leaf ordering in hierarchical clustering without any explicit leaf ordering is fairly arbitrary, as there are 2^(n-1) options to order a dendrogram. This often causes pattern across subtrees to be arbitrarily broken up, which can be a sub optimal data representation. The solution is to optimize the leaf ordering, such that the distance between successive leafs is as small as possible.
This has been recognized in the literature (https://academic.oup.com/bioinformatics/article/17/suppl_1/S22/261423) and also addressed in scipy (from v1.0, https://github.com/scipy/scipy/issues/5238, https://github.com/scipy/scipy/pull/7501) based on the polo implementation (https://github.com/adrianveres/polo). As far as I have experienced or could find out, this is not considered in the currently default used
fastcluster
package.Apart from the additional computational cost, there seems to be no real downside of enabling optimal ordering. It just improves a previously random order, making clustered heatmaps smoother and the visualization more meaningful. I experienced several times in real world datasets that this can reveal significantly more structure in heatmaps.
Examples: https://github.com/adrianveres/polo/blob/master/data/demo.png http://www.cs.cmu.edu/~zivbj/compBio/ismb01/optimal.html
The additional computational cost is <1s for less than 512 leafs (https://github.com/adrianveres/polo/blob/master/data/bench.png), arguably not an issue for small visualizations.
While it is possible to just pre-calculate the optimal ordering linkage via
scipy.cluster.hierarchy.linkage
and theoptimal_ordering
argument separately, a lot of users are likely not even aware that optimal leaf ordering is something to consider. Thus, I would really argue that the benefit of having this more optimal clustering more easily available, e.g. via an 'order_dendrogram_optimal' option, would be beneficial for many and would justify the additional complexity of having an additional optional argument.Thanks for your hard work and for considering this suggestion.