Open defleury opened 8 years ago
Makes sense to me. I'll wait a bit for any additional comments or suggestions, but should be straightforward to include without much risk to downstream processes.
This seems like a good compromise to me, of course, we actually do know what the outgroup is, but if we don't the long branch is usually a good bet.
Susan
On Tue, Apr 12, 2016 at 3:30 PM, Paul J. McMurdie notifications@github.com wrote:
Makes sense to me. I'll wait a bit for any additional comments or suggestions, but should be straightforward to include without much risk to downstream processes.
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/joey711/phyloseq/issues/597#issuecomment-209129082
Susan Holmes Professor, Statistics and BioX John Henry Samter Fellow in Undergraduate Education Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/
Dear Susan,
just on that last one: for the datasets that I work with, I usually don't know in advance what the outgroup in a tree of OTU representatives will be. Generally, I can't include an artificial outgroup during tree inference, because I do not know enough about my sequences/OTUs a priori. I guess that midpoint rooting would be the best solution in such cases, but as I said, it is far trickier to implement and and computationally prohibitive for large trees. Is one of you aware whether the exact effect of where a tree is rooted on UniFrac and other phylogenetic diversities has ever been tested? Theoretically, I can see the rooting have an impact, but I'm not sure how strong that effect would be in practice, anyway.
I spent some time thinking about this a couple years ago. Here is a post that may have some information of interest to you in the QIIME google forum https://groups.google.com/forum/#!topic/qiime-forum/rGXLnfnHbFA.
For one dataset I analyzed I found no difference in Unifrac inference whether I mid-point rooted or used an outgroup, but this is antedoctal and not replicated. I imagine that in some instances this would probably matter. As the post from Mike reads in that link above, he found different results depending on his choice.
My choice has been to do this for my work with 16s rRNA primers for bacteria in which I do pick up a few archaea here and there, but my outgroup is still always the longest branch:
Nonetheless, I think there will be a challenge to finding a 'solve-all' for all trees. I think it is an excellent topic to discuss though.
Thanks @CarlyRae that's helpful to hear.
And to be clear on the development side so that anyone else reading this post understands, we're only considering an improvement to the default behavior for rooting a tree when phyloseq detects that it is unrooted. Trees that are already rooted will not and should not be modified behind the user's back. If they've specified a root already, whether intentionally or by accident, it will be the root that is used. Therefore, anyone with a strong opinion about which branch should be considered the root need only assign it as root before handing it off to phyloseq, via the following ape one-liner:
rootedTree = ape::root(tree.unrooted, outgroup=new.outgroup, resolve.root=TRUE)
After trying out the function in the OP, it had some bugs, especially if the longest branch was not among the terminal branches. Here is a revised version:
pick_new_outgroup <- function(tree.unrooted){
require("magrittr")
require("data.table")
require("ape") # ape::Ntip
# tablify parts of tree that we need.
treeDT <-
cbind(
data.table(tree.unrooted$edge),
data.table(length = tree.unrooted$edge.length)
)[1:Ntip(tree.unrooted)] %>%
cbind(data.table(id = tree.unrooted$tip.label))
# Take the longest terminal branch as outgroup
new.outgroup <- treeDT[which.max(length)]$id
return(new.outgroup)
}
It returns the string of the tip-label for the new proposed outgroup. It is then up to the user/process to take that result and do the next rooting step. This is a bit easier to test and maintain than also doing the rooting and opaquely returning the tree.
So if you wanted to use this now, you would use:
new.outgroup = pick_new_outgroup(tree.unrooted)
rootedTree = ape::root(tree.unrooted, outgroup=new.outgroup, resolve.root=TRUE)
Hope that helps. I'll probably de-magrittr the function for inclusion in phyloseq, but I'll try and get this included as the default operation when a function that requires a rooted tree encounters an unrooted tree.
root_by_longest_edge <- function(unrooted) {
longest.edge <- which.max(unrooted$edge.length)
long.nodes <- unrooted$edge[longest.edge,] #this should usually include one tip
tree.rooted <- phytools::reroot(tree = unrooted, node.number = longest.edge)
return(tree.rooted)
}
That was my solution. I'm not sure if it would have the same bugs or not
Yes, it would have the same bugs, because ~half of the nodes are internal (don't include a tip).
Now, one might fairly wonder if the root node must have a tip in the first place. I can think of legitimate scenarios where you would not want that to be the case, and your function appears to work if there is not a tip requirement.
When importing (unrooted) phylogenies, the current phyloseq tutorials and/or functions implement a very ad hoc rooting of the tree, placing the root at a randomly picked tip, like so:
This can potentially lead to some issues, depending on tree structure etc., in particular when calculating UniFrac distances. I would suggest a very straightforward and pragmatic (and arguably heuristic and certainly not perfect) approach to autmatically pick an outgroup to which to root the tree:
This will generally run very fast and provide a good guess about the outgroup. I am not a phylogenetics expert, but I think that placing the root at the longest individual branch should be a reasonable guess. Other approaches, like midpoint rooting, are computationally expensive (or in the case of the
ape
package, prohibitive) even for moderately large trees.