gluc / data.tree

General Purpose Hierarchical Data Structure for R
http://gluc.github.io/data.tree
208 stars 40 forks source link

Slow conversion between data.tree and Newick format #92

Open cysouw opened 7 years ago

cysouw commented 7 years ago

Is there a way to speed up this conversion, specifically with ToNewick?

library(ape)
data(bird.families)
system.time(bf <- as.Node(bird.families))
system.time(n <- ToNewick(bf))
gluc commented 7 years ago

That is indeed a bit slow :-( Will look into it at some point, but no promises on when. Very open for pull requests though!

cysouw commented 7 years ago

Apparently, recursive programming is not ideal inside R :-(, e.g. http://dirk.eddelbuettel.com/blog/2011/09/08/#rcpp_for_recursion

So, there should be some workaround possible ("memoization") but that would sort-of destroy the whole approach of the data.tree package. So, pass everything to C?

gluc commented 7 years ago

As we are talking a few hundred nodes, recursion is not the problem. Rather, the current DefaultPlotHeight implementation is recursive, and on top it's called recursively by ToNewick. This, of course, is highly sub-optimal.

If you don't need the height in your Newick, the simplest solution is to avoid it, like so:

n <- ToNewick(bf, heightAttribute = NULL)

If you need the plot height, a workaround is to cache the plot height attribute on the nodes, like so:

library(ape)
library(data.tree)
data(bird.families)
bf <- as.Node(bird.families)

SetPlotHeight <- function(node, rootHeight = 100) {

  #traverse from leaves towards root to calculate the height and store it in height2
  #Note: cannot call it height, as that already exists on `Node`
  Set(node$leaves, height2 = 1)
  node$Do(function(x) x$height2 <- Aggregate(x, "height2", max) + 1, traversal = "post-order", filterFun = isNotLeaf)

  #traverse from root to leaves to calculate the plotHeight
  #(where plotHeight uses the semantics of dendrogram/phylo, where leaves should have height 0 and root height = rootHeight. This meaning is not the same as in the data.tree package)
  node$plotHeight <- rootHeight
  node$Do(function(x) x$plotHeight <- x$parent$plotHeight * (1 - 1 / x$height2), filterFun = isNotRoot)
}

SetPlotHeight(bf)

Then, instead of providing a heightAttribute function, provide the name of the cached attribute:

n <- ToNewick(bf, heightAttribute = "plotHeight")

I need to figure out whether there is a clean way to fix this for good in the DefaultPlotHeight function, thought I doubt it. At least it should be documented in the ToNewick, as.dendrogram.Node, and as.phylo.Node

geekyi commented 4 years ago

This is exactly what I was looking for!! This cut down the time to convert a tree with 5971 tips and 1631 internal nodes, from several(!!) minutes to just 2 secs!:

If you don't need the height in your Newick, the simplest solution is to avoid it

Perhaps this is the cause of slowdowns in other conversions like ToDataFrameTable as well? cf. this gist. I'm wondering about the slowness in conversion (of seconds or more) even after pruning to a depth of 2-6. Whereas just almost any other operation like leafCount for example is almost instant for the whole tree.

For context, I'm trying to convert a json file via data.tree to tidytree format to do some visualizations. Any other alternative would also be very helpful.

In the long term, I wonder how hard it would to implement some parts in rcpp? I'm hoping this can be my go to package for hierarchical data. There really is a lack of tools in R for stuff like json. I'd be willing to spend some time if needed to work on this. Since I really need a way to work with and visualize large json structures, any suggestions for alternatives is fine too..