scikit-bio / scikit-bio

scikit-bio: a community-driven Python library for bioinformatics, providing versatile data structures, algorithms and educational resources.
https://scikit.bio
BSD 3-Clause "New" or "Revised" License
891 stars 269 forks source link

add TreeNode.from_linkage/to_linkage methods #346

Open gregcaporaso opened 10 years ago

gregcaporaso commented 10 years ago

The scipy linkage object is returned by the scipy hierarchical clustering methods, so being able to load or generate these from TreeNode's provides the glue between scipy's hierarchical clustering and skbio's TreeNode.

@ebolyen is working on this. @wasade, would you be available to review?

gregcaporaso commented 10 years ago

And I think from_linkage is higher priority than to_linkage as we have an immediate use-case for the former. If the latter is easy, we should just do it now, but if it would hold up getting these changes into 0.1.0, we should do to_linkage at a later time.

gregcaporaso commented 10 years ago

Looks like we never did to_linkage. This would be really useful, as it'd let us plot with the SciPy dendrogram function, which has problems (#531) but is better than ascii_art. See this recipe for an example of why this would be good.

@ebolyen, would you be able to do this since you did the from_linkage method?

mortonjt commented 8 years ago

Adding the to_linkage method would be extremely helpful. For example, it would be really nice to pass a linkage matrix into seaborn.clustermap to align a tree with a heatmap.

jairideout commented 8 years ago

Agree, is anyone available to work on to_linkage?

mortonjt commented 8 years ago

I'll need this for a paper anyways, so I'll probably have something ready within a month.

jairideout commented 8 years ago

Awesome! Is it okay if I assign this issue to you?

mortonjt commented 8 years ago

Sure thing.

On Thu, Aug 11, 2016 at 3:42 PM, Jai Ram Rideout notifications@github.com wrote:

Awesome! Is it okay if I assign this issue to you?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore/scikit-bio/issues/346#issuecomment-239269210, or mute the thread https://github.com/notifications/unsubscribe-auth/AD_a3UOWPW5CeO5IcltUBJ475GbpTlW1ks5qe3sOgaJpZM4B67pq .

gregcaporaso commented 8 years ago

This will be really great to get in there.

mortonjt commented 7 years ago

Hi @gregcaporaso @jairideout . I've been looking at this more carefully.

It turns out that this is a really hard problem.

  1. The linkage format assumes balanced distances (i.e. the distance between children and their parents are equal), it is not possible to actually encode the branch lengths into the linkage matrix
  2. The encoding isn't very clear - but if I have to guess based on the source code, it looks like there is a reverse post-traversal going on (right node gets traversed first instead of the left node).

Sorry to be the bearer of bad news - but it looks like we would lose the branch length information if we cast to this object. But if we really wanted this, it is probably possible to do this cast by setting all of the distances to 1 or something. Even with this scheme, it is tricky to figure out how to get the right matrix, given the traversal issue.

ebolyen commented 7 years ago

@mortonjt, branch lengths can be encoded, and siblings do not have to have identical distances to their parents (except for tips), however the tree does have to be an ultrametric binary tree (strictly bifurcating with every tip being equidistant to the root).

The linkage isn't exactly a reversed depth-first search, it actually is ordered by shortest subtree (you should see distances increase as you traverse the linkage matrix as the distance recorded represents twice the tip to root so far, or the distance between two "clusters").

I think we should have some way of validating that a given tree is bifurcating and ultrametric, and raise an error otherwise.

mortonjt commented 7 years ago

@ebolyen you are right - this method will only work properly on ultrametric bifurcating trees. I didn't know about the shortest subtree ordering. Thanks!

mortonjt commented 6 years ago

This may of interest in the future -- particularly now that seaborn remains to be the best plotting program for dendrogram heatmaps.