joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
584 stars 187 forks source link

More pragmatic tree rooting upon import #597

Open defleury opened 8 years ago

defleury commented 8 years ago

When importing (unrooted) phylogenies, the current phyloseq tutorials and/or functions implement a very ad hoc rooting of the tree, placing the root at a randomly picked tip, like so:

tree.rooted <- root(tree.unrooted, sample(tree.unrooted$tip.labels, 1), resolve.root=T)

This can potentially lead to some issues, depending on tree structure etc., in particular when calculating UniFrac distances. I would suggest a very straightforward and pragmatic (and arguably heuristic and certainly not perfect) approach to autmatically pick an outgroup to which to root the tree:

root.by.outgroup <- function(tree.unrooted) {
  longest.edge <- which.max(tree.unrooted$edge.length)
  long.nodes <- tree.unrooted$edge[longest.edge,]     #this should usually include one tip
  new.outgroup <- long.nodes[long.nodes < Ntip(tree.unrooted)]
  tree.rooted <- root(tree.unrooted, outgroup=new.outgroup, resolve.root=T)
  tree.rooted
}

This will generally run very fast and provide a good guess about the outgroup. I am not a phylogenetics expert, but I think that placing the root at the longest individual branch should be a reasonable guess. Other approaches, like midpoint rooting, are computationally expensive (or in the case of the ape package, prohibitive) even for moderately large trees.

joey711 commented 8 years ago

Makes sense to me. I'll wait a bit for any additional comments or suggestions, but should be straightforward to include without much risk to downstream processes.

spholmes commented 8 years ago

This seems like a good compromise to me, of course, we actually do know what the outgroup is, but if we don't the long branch is usually a good bet.

Susan

On Tue, Apr 12, 2016 at 3:30 PM, Paul J. McMurdie notifications@github.com wrote:

Makes sense to me. I'll wait a bit for any additional comments or suggestions, but should be straightforward to include without much risk to downstream processes.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/joey711/phyloseq/issues/597#issuecomment-209129082

Susan Holmes Professor, Statistics and BioX John Henry Samter Fellow in Undergraduate Education Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/

defleury commented 8 years ago

Dear Susan,

just on that last one: for the datasets that I work with, I usually don't know in advance what the outgroup in a tree of OTU representatives will be. Generally, I can't include an artificial outgroup during tree inference, because I do not know enough about my sequences/OTUs a priori. I guess that midpoint rooting would be the best solution in such cases, but as I said, it is far trickier to implement and and computationally prohibitive for large trees. Is one of you aware whether the exact effect of where a tree is rooted on UniFrac and other phylogenetic diversities has ever been tested? Theoretically, I can see the rooting have an impact, but I'm not sure how strong that effect would be in practice, anyway.

CarlyMuletzWolz commented 8 years ago

I spent some time thinking about this a couple years ago. Here is a post that may have some information of interest to you in the QIIME google forum https://groups.google.com/forum/#!topic/qiime-forum/rGXLnfnHbFA.

For one dataset I analyzed I found no difference in Unifrac inference whether I mid-point rooted or used an outgroup, but this is antedoctal and not replicated. I imagine that in some instances this would probably matter. As the post from Mike reads in that link above, he found different results depending on his choice.

My choice has been to do this for my work with 16s rRNA primers for bacteria in which I do pick up a few archaea here and there, but my outgroup is still always the longest branch: image

Nonetheless, I think there will be a challenge to finding a 'solve-all' for all trees. I think it is an excellent topic to discuss though.

joey711 commented 8 years ago

Thanks @CarlyRae that's helpful to hear.

And to be clear on the development side so that anyone else reading this post understands, we're only considering an improvement to the default behavior for rooting a tree when phyloseq detects that it is unrooted. Trees that are already rooted will not and should not be modified behind the user's back. If they've specified a root already, whether intentionally or by accident, it will be the root that is used. Therefore, anyone with a strong opinion about which branch should be considered the root need only assign it as root before handing it off to phyloseq, via the following ape one-liner:

rootedTree = ape::root(tree.unrooted, outgroup=new.outgroup, resolve.root=TRUE)
joey711 commented 6 years ago

After trying out the function in the OP, it had some bugs, especially if the longest branch was not among the terminal branches. Here is a revised version:

pick_new_outgroup <- function(tree.unrooted){
    require("magrittr")
    require("data.table")
    require("ape") # ape::Ntip
    # tablify parts of tree that we need.
    treeDT <- 
      cbind(
        data.table(tree.unrooted$edge),
        data.table(length = tree.unrooted$edge.length)
      )[1:Ntip(tree.unrooted)] %>% 
      cbind(data.table(id = tree.unrooted$tip.label))
    # Take the longest terminal branch as outgroup
    new.outgroup <- treeDT[which.max(length)]$id
    return(new.outgroup)
  }

It returns the string of the tip-label for the new proposed outgroup. It is then up to the user/process to take that result and do the next rooting step. This is a bit easier to test and maintain than also doing the rooting and opaquely returning the tree.

So if you wanted to use this now, you would use:

new.outgroup = pick_new_outgroup(tree.unrooted)
rootedTree = ape::root(tree.unrooted, outgroup=new.outgroup, resolve.root=TRUE)

Hope that helps. I'll probably de-magrittr the function for inclusion in phyloseq, but I'll try and get this included as the default operation when a function that requires a rooted tree encounters an unrooted tree.

grabear commented 6 years ago

root_by_longest_edge <- function(unrooted) {

longest.edge <- which.max(unrooted$edge.length)

long.nodes <- unrooted$edge[longest.edge,] #this should usually include one tip

tree.rooted <- phytools::reroot(tree = unrooted, node.number = longest.edge)

return(tree.rooted)

}

grabear commented 6 years ago

That was my solution. I'm not sure if it would have the same bugs or not

joey711 commented 6 years ago

Yes, it would have the same bugs, because ~half of the nodes are internal (don't include a tip).

Now, one might fairly wonder if the root node must have a tip in the first place. I can think of legitimate scenarios where you would not want that to be the case, and your function appears to work if there is not a tip requirement.