ropensci / taxize

A taxonomic toolbelt for R
https://docs.ropensci.org/taxize
Other
268 stars 60 forks source link

Node names for class2tree #644

Closed arendsee closed 6 years ago

arendsee commented 6 years ago
Session Info ```r R version 3.4.2 (2017-09-28) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Arch Linux Matrix products: default BLAS: /usr/lib/libblas.so.3.7.1 LAPACK: /usr/lib/liblapack.so.3.7.1 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] magrittr_1.5 taxize_0.9.0.9214 loaded via a namespace (and not attached): [1] Rcpp_0.12.13 codetools_0.2-15 lattice_0.20-35 [4] ape_5.0 zoo_1.8-0 reshape_0.8.7 [7] foreach_1.4.3 grid_3.4.2 crul_0.4.0 [10] R6_2.2.2 plyr_1.8.4 jsonlite_1.5 [13] nlme_3.1-131 httr_1.3.1 stringi_1.1.5 [16] reshape2_1.4.2 curl_3.0 data.table_1.10.4-3 [19] xml2_1.1.1 iterators_1.0.8 tools_3.4.2 [22] stringr_1.2.0 bold_0.5.0 parallel_3.4.2 [25] compiler_3.4.2 ```

I am trying to get the NCBI tree for a vector of taxon IDs. I can get close to what I want with the following code:

library(taxize)
lineages <- taxize::classification(c(9606, 9598, 9593), db='ncbi')
classtree_obj <- taxize::class2tree(lineages, check=FALSE)
phylo_obj <- classtree_obj$phylo

However, in the resulting phylo object, node labels are missing. Since we are coming from the NCBI common tree, we do have names for all the internal nodes: e.g.

> classification(9606, db='ncbi') 
$`9606`                           
                   name         rank      id                        
1    cellular organisms      no rank  131567                        
2             Eukaryota superkingdom    2759                        
3          Opisthokonta      no rank   33154                        
4               Metazoa      kingdom   33208                        
...                    
26           Catarrhini    parvorder    9526                        
27           Hominoidea  superfamily  314295                        
28            Hominidae       family    9604                        
29            Homininae    subfamily  207598                        
30                 Homo        genus    9605                        
31         Homo sapiens      species    9606                        

attr(,"class")                    
[1] "classification"              
attr(,"db")                       
[1] "ncbi"   

Would it be possible to write these names into the node labels slot of the phylo objects?

Alternatively, is there a more direct way to get the NCBI tree for a set of species?

sckott commented 6 years ago

thanks @arendsee

will have a look - we did do a recent overhaul of this function FYI https://github.com/ropensci/taxize/pull/634

cc @gedankenstuecke @trvinh any thoughts on this since you were involved in the redo of this fxn

gedankenstuecke commented 6 years ago

Hey there, I think that's a cool idea, I'm not sure how easy it would be to implement it:

Right now the phylo object comes straight out of hclust and is then just converted into a phylo object. @trvinh Do you have experience in manipulating those?

trvinh commented 6 years ago

Hi all, regarding to phylo object, as @gedankenstuecke said, is an output of as.phylo.hclust function. The input for as.phylo.hclust is a distance matrix, which contains only the tip labels and their corresponding distances. Therefore the phylo object has no information for the internal node labels. Currently I have no idea how to include those names into phylo object :-P

@arendsee could you tell me one situation where you require the node labels inside the phyloobject? If we need to get the common ancestor's name for a pair (or sub set) of species, I think we can parse it from the classification of the class2tree function (classtree_obj$classification in your example). If this is what you need, you can take a look at this function. Secondly, what do you mean by "more direct way"? In this gist page I wrote an example for plotting an NCBI tree for a list of species. Is it what you want?

arendsee commented 6 years ago

I want to be able to plot the distribution of homologs across a tree. The node labels are useful because they tell which clades have homologs. More broadly, I want to make a 'phylo' object from a subset the NCBI taxonomy common tree. Then I want to be able to perform operations on clades (for example, highlighting the subtree descending from some node).

I could parse the node names out of the classification field, but doing so feels a bit hacky and the phylo objects already have a system for storing node names.

As for a more direct way, here is an implementation that just takes the union of the lineages (preserving node names but not calculating branch lengths):

as_tree <- function(lineages){

  # get species names
  tip.label <- unique(sapply(lineages, function(x) tail(x$name, 1)))

  # get species taxon IDs
  tip.id <- unique(sapply(lineages, function(x) tail(x$id, 1)))

  # build edge list based on taxonomy IDs
  edges <- unique(do.call(what=rbind,
    lapply(lineages, function(x){
      matrix(c(head(x$id, -1), tail(x$id, -1)), ncol=2)
    })
  ))

  # get node IDs
  node.id <- setdiff(c(edges[,1], edges[,2]), tip.id)

  # make a map from taxonomy ID to internal 1:n ids
  idmap <- 1:(length(tip.label) + length(node.id))
  names(idmap) <- c(tip.id, node.id)

  # make a phylo object
  tree <- list(
    edges      = matrix(c(idmap[edges[,1]], idmap[edges[,2]]), ncol=2),
    tip.label  = unname(tip.label),
    node.label = unname(node.id),
    Nnode      = length(node.id)
  )
  class(tree) <- 'phylo'

  tree
}
arendsee commented 6 years ago

The phylo object created by the function I wrote above should have the same topology as the phylo object created by class2tree. So we can port the node names over. However, this requires a bit of work, since the trees will not generally have the same node indices (e.g. rotating a branch will change node ids). But we can make a map between indices:

### First a few utility functions
# map names to tree indices
name2id <- function(tree, names){
  match(names, c(tree$tip.label, tree$node.label))
}
# get the parents for a vector of nodes
parent <- function(tree, id){
  tree$edge[match(id, tree$edge[, 2]), 1]
}
# get vector of logicals which are TRUE if the id is root
is_root <- function(tree, id){
  !(id %in% tree$edge[, 2])
}

#' Map indices from a to b based off tip labels
#'
#' @param a phylo object
#' @param b phylo object
#' @return matrix mapping indices from `a` to `b`
map_ids <- function(a, b){
  # TODO: assert a and b have the same topology
  # start with map from the tips of `a` to those of `b`
  idmaps <- list(matrix(
      c(
        name2id(a, a$tip.label),
        name2id(b, a$tip.label)
      ), ncol=2))
  # iteratively find the parents of nodes, until root is reached
  while(TRUE){
    last_a <- tail(idmaps, n=1)[[1]][, 1]
    last_b <- tail(idmaps, n=1)[[1]][, 2]
    if(!all(is_root(a, last_a))){
      idmaps <- append(
        idmaps,
        list(unique(matrix(c(
            parent(a, last_a),
            parent(b, last_b)
          ), ncol=2)))
      )
    } else {
      break
    }
  }
  # merge the tables
  idmap <- do.call(rbind, idmaps)
  # get the unique rows that are not at root
  idmap <- unique(idmap[!is.na(idmap[, 1]), ])
  idmap
}

With this map, we can transfer node labels between trees:

#' Map nodes names from a to b
#'
#' The trees a and by must have the same topology and same tip labels.
#'
#' @param a phylo object with node labels
#' @param b phylo object
map_node_label <- function(a, b){
  # get map of indices from a to b
  idmap <- map_ids(a, b)
  # get the number of tips
  offset <- length(a$tip.label)
  # get the node indices, these are above the tip indices
  idmap <- idmap[idmap[, 1] > offset, ]
  # rebase
  idmap[, 1] <- idmap[, 1] - offset
  idmap[, 2] <- idmap[, 2] - offset
  # map names from a to b
  b$node.label[idmap[, 2]] <- a$node.label[idmap[, 1]]
  b
}

All of this assumes the trees have the same topology and tip labels. However, class2tree resolves polytomies (nodes with multiple children), forcing binary trees. We can work around this by collapsing the zero-length branches with ape::di2multi.

Putting all this together:

# borrowing the species list from the gist example
spnames <- c('Homo_sapiens','Pan_troglodytes','Macaca_mulatta','Mus_musculus','Rattus_norvegicus','Bos_taurus','Canis_lupus','Ornithorhynchus_anatinus','Xenopus_tropicalis','Takifugu_rubripes','Gallus_gallus','Ciona_intestinalis','Branchiostoma_floridae','Schistosoma_mansoni','Caenorhabditis_elegans','Anopheles_gambiae','Drosophila_melanogaster','Ixodes_scapularis','Ustilago_maydis','Neurospora_crassa','Monodelphis_domestica','Danio_rerio','Nematostella_vectensis','Cryptococcus_neoformans')

lineages <- taxize::classification(spnames, db='ncbi')

a <- as_tree(lineages)
b <- taxize::class2tree(lineages, check=FALSE)$phylo
b <- ape::di2multi(b)
b <- map_node_label(a, b)

This basically solves my problem. Though it is a bit circuitous.

sckott commented 6 years ago

thanks @arendsee having a look

sckott commented 6 years ago

@arendsee hmm, the example you gave above didn't work for me,

lineages <- taxize::classification(spnames, db='ncbi')

a <- as_tree(lineages)
b <- taxize::class2tree(lineages, check=FALSE)$phylo
b <- ape::di2multi(b)
map_node_label(a, b)
#> Error in b$node.label[idmap[, 2]] <- a$node.label[idmap[, 1]] :
#>   NAs are not allowed in subscripted assignments
arendsee commented 6 years ago

Damn, you're right. One second.

arendsee commented 6 years ago

Ah, try this:

a <- ape::collapse.singles(as_tree(lineages))
b <- taxize::class2tree(lineages, check=FALSE)$phylo
b <- ape::di2multi(b)
map_node_label(a, b)
arendsee commented 6 years ago

Although, the collapse.singles function should probably be called up in the as_tree function instead. Also, map_ids should die if the inputs have different topologies.

sckott commented 6 years ago

Nice, now it works for me.

sckott commented 6 years ago

@arendsee so what are you thinking here? do you propose to modify class2tree function with above changes? or is it a separate function? or something else?

arendsee commented 6 years ago

@sckott Sorry for the delayed response. I've been preoccupied writing a paper. I should be done by the end of the week, then I'll work a bit on a solution.

sckott commented 6 years ago

bump @arendsee

arendsee commented 6 years ago

@sckott Ah right. That done by the end of the week comment was totally wrong. I'll review the situation and get back with you later this weekend.

sckott commented 6 years ago

thanks!

arendsee commented 6 years ago

@sckott Have you worked at all with Guangchuang Yu? He wrote a package tidytree that represents trees in a nice tabular fashion (full disclosure, I'm a (very) minor contributor). I think this representation may be easier to work with than phylo. In my experience, programming with phylo is a nightmare.

I could update the class2tree with the functions I wrote above to map nodes between trees. But the code in my functions is such an kludgy mess that I hesitate to do so.

sckott commented 6 years ago

sorta, he has a few submissions in ropensci onboarding.

That sounds good to move to the tidytree format.

sckott commented 6 years ago

@arendsee any thoughts on your above statement?

I could update the class2tree with the functions I wrote above to map nodes between trees

sckott commented 6 years ago

closing due to long inactivity - please reopen though if anyone wants to address this