Closed arendsee closed 6 years ago
thanks @arendsee
will have a look - we did do a recent overhaul of this function FYI https://github.com/ropensci/taxize/pull/634
cc @gedankenstuecke @trvinh any thoughts on this since you were involved in the redo of this fxn
Hey there, I think that's a cool idea, I'm not sure how easy it would be to implement it:
Right now the phylo object comes straight out of hclust and is then just converted into a phylo object. @trvinh Do you have experience in manipulating those?
Hi all,
regarding to phylo
object, as @gedankenstuecke said, is an output of as.phylo.hclust function. The input for as.phylo.hclust is a distance matrix, which contains only the tip labels and their corresponding distances. Therefore the phylo
object has no information for the internal node labels. Currently I have no idea how to include those names into phylo
object :-P
@arendsee could you tell me one situation where you require the node labels inside the phylo
object? If we need to get the common ancestor's name for a pair (or sub set) of species, I think we can parse it from the classification
of the class2tree function (classtree_obj$classification
in your example). If this is what you need, you can take a look at this function.
Secondly, what do you mean by "more direct way"? In this gist page I wrote an example for plotting an NCBI tree for a list of species. Is it what you want?
I want to be able to plot the distribution of homologs across a tree. The node labels are useful because they tell which clades have homologs. More broadly, I want to make a 'phylo' object from a subset the NCBI taxonomy common tree. Then I want to be able to perform operations on clades (for example, highlighting the subtree descending from some node).
I could parse the node names out of the classification
field, but doing so feels a bit hacky and the phylo objects already have a system for storing node names.
As for a more direct way, here is an implementation that just takes the union of the lineages (preserving node names but not calculating branch lengths):
as_tree <- function(lineages){
# get species names
tip.label <- unique(sapply(lineages, function(x) tail(x$name, 1)))
# get species taxon IDs
tip.id <- unique(sapply(lineages, function(x) tail(x$id, 1)))
# build edge list based on taxonomy IDs
edges <- unique(do.call(what=rbind,
lapply(lineages, function(x){
matrix(c(head(x$id, -1), tail(x$id, -1)), ncol=2)
})
))
# get node IDs
node.id <- setdiff(c(edges[,1], edges[,2]), tip.id)
# make a map from taxonomy ID to internal 1:n ids
idmap <- 1:(length(tip.label) + length(node.id))
names(idmap) <- c(tip.id, node.id)
# make a phylo object
tree <- list(
edges = matrix(c(idmap[edges[,1]], idmap[edges[,2]]), ncol=2),
tip.label = unname(tip.label),
node.label = unname(node.id),
Nnode = length(node.id)
)
class(tree) <- 'phylo'
tree
}
The phylo object created by the function I wrote above should have the same topology as the phylo object created by class2tree
. So we can port the node names over. However, this requires a bit of work, since the trees will not generally have the same node indices (e.g. rotating a branch will change node ids). But we can make a map between indices:
### First a few utility functions
# map names to tree indices
name2id <- function(tree, names){
match(names, c(tree$tip.label, tree$node.label))
}
# get the parents for a vector of nodes
parent <- function(tree, id){
tree$edge[match(id, tree$edge[, 2]), 1]
}
# get vector of logicals which are TRUE if the id is root
is_root <- function(tree, id){
!(id %in% tree$edge[, 2])
}
#' Map indices from a to b based off tip labels
#'
#' @param a phylo object
#' @param b phylo object
#' @return matrix mapping indices from `a` to `b`
map_ids <- function(a, b){
# TODO: assert a and b have the same topology
# start with map from the tips of `a` to those of `b`
idmaps <- list(matrix(
c(
name2id(a, a$tip.label),
name2id(b, a$tip.label)
), ncol=2))
# iteratively find the parents of nodes, until root is reached
while(TRUE){
last_a <- tail(idmaps, n=1)[[1]][, 1]
last_b <- tail(idmaps, n=1)[[1]][, 2]
if(!all(is_root(a, last_a))){
idmaps <- append(
idmaps,
list(unique(matrix(c(
parent(a, last_a),
parent(b, last_b)
), ncol=2)))
)
} else {
break
}
}
# merge the tables
idmap <- do.call(rbind, idmaps)
# get the unique rows that are not at root
idmap <- unique(idmap[!is.na(idmap[, 1]), ])
idmap
}
With this map, we can transfer node labels between trees:
#' Map nodes names from a to b
#'
#' The trees a and by must have the same topology and same tip labels.
#'
#' @param a phylo object with node labels
#' @param b phylo object
map_node_label <- function(a, b){
# get map of indices from a to b
idmap <- map_ids(a, b)
# get the number of tips
offset <- length(a$tip.label)
# get the node indices, these are above the tip indices
idmap <- idmap[idmap[, 1] > offset, ]
# rebase
idmap[, 1] <- idmap[, 1] - offset
idmap[, 2] <- idmap[, 2] - offset
# map names from a to b
b$node.label[idmap[, 2]] <- a$node.label[idmap[, 1]]
b
}
All of this assumes the trees have the same topology and tip labels. However, class2tree
resolves polytomies (nodes with multiple children), forcing binary trees. We can work around this by collapsing the zero-length branches with ape::di2multi
.
Putting all this together:
# borrowing the species list from the gist example
spnames <- c('Homo_sapiens','Pan_troglodytes','Macaca_mulatta','Mus_musculus','Rattus_norvegicus','Bos_taurus','Canis_lupus','Ornithorhynchus_anatinus','Xenopus_tropicalis','Takifugu_rubripes','Gallus_gallus','Ciona_intestinalis','Branchiostoma_floridae','Schistosoma_mansoni','Caenorhabditis_elegans','Anopheles_gambiae','Drosophila_melanogaster','Ixodes_scapularis','Ustilago_maydis','Neurospora_crassa','Monodelphis_domestica','Danio_rerio','Nematostella_vectensis','Cryptococcus_neoformans')
lineages <- taxize::classification(spnames, db='ncbi')
a <- as_tree(lineages)
b <- taxize::class2tree(lineages, check=FALSE)$phylo
b <- ape::di2multi(b)
b <- map_node_label(a, b)
This basically solves my problem. Though it is a bit circuitous.
thanks @arendsee having a look
@arendsee hmm, the example you gave above didn't work for me,
lineages <- taxize::classification(spnames, db='ncbi')
a <- as_tree(lineages)
b <- taxize::class2tree(lineages, check=FALSE)$phylo
b <- ape::di2multi(b)
map_node_label(a, b)
#> Error in b$node.label[idmap[, 2]] <- a$node.label[idmap[, 1]] :
#> NAs are not allowed in subscripted assignments
Damn, you're right. One second.
Ah, try this:
a <- ape::collapse.singles(as_tree(lineages))
b <- taxize::class2tree(lineages, check=FALSE)$phylo
b <- ape::di2multi(b)
map_node_label(a, b)
Although, the collapse.singles
function should probably be called up in the as_tree
function instead. Also, map_ids
should die if the inputs have different topologies.
Nice, now it works for me.
@arendsee so what are you thinking here? do you propose to modify class2tree
function with above changes? or is it a separate function? or something else?
@sckott Sorry for the delayed response. I've been preoccupied writing a paper. I should be done by the end of the week, then I'll work a bit on a solution.
bump @arendsee
@sckott Ah right. That done by the end of the week comment was totally wrong. I'll review the situation and get back with you later this weekend.
thanks!
@sckott Have you worked at all with Guangchuang Yu? He wrote a package tidytree that represents trees in a nice tabular fashion (full disclosure, I'm a (very) minor contributor). I think this representation may be easier to work with than phylo
. In my experience, programming with phylo
is a nightmare.
I could update the class2tree
with the functions I wrote above to map nodes between trees. But the code in my functions is such an kludgy mess that I hesitate to do so.
sorta, he has a few submissions in ropensci onboarding.
That sounds good to move to the tidytree format.
@arendsee any thoughts on your above statement?
I could update the class2tree with the functions I wrote above to map nodes between trees
closing due to long inactivity - please reopen though if anyone wants to address this
Session Info
```r R version 3.4.2 (2017-09-28) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Arch Linux Matrix products: default BLAS: /usr/lib/libblas.so.3.7.1 LAPACK: /usr/lib/liblapack.so.3.7.1 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] magrittr_1.5 taxize_0.9.0.9214 loaded via a namespace (and not attached): [1] Rcpp_0.12.13 codetools_0.2-15 lattice_0.20-35 [4] ape_5.0 zoo_1.8-0 reshape_0.8.7 [7] foreach_1.4.3 grid_3.4.2 crul_0.4.0 [10] R6_2.2.2 plyr_1.8.4 jsonlite_1.5 [13] nlme_3.1-131 httr_1.3.1 stringi_1.1.5 [16] reshape2_1.4.2 curl_3.0 data.table_1.10.4-3 [19] xml2_1.1.1 iterators_1.0.8 tools_3.4.2 [22] stringr_1.2.0 bold_0.5.0 parallel_3.4.2 [25] compiler_3.4.2 ```I am trying to get the NCBI tree for a vector of taxon IDs. I can get close to what I want with the following code:
However, in the resulting
phylo
object, node labels are missing. Since we are coming from the NCBI common tree, we do have names for all the internal nodes: e.g.Would it be possible to write these names into the node labels slot of the
phylo
objects?Alternatively, is there a more direct way to get the NCBI tree for a set of species?