joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
582 stars 187 forks source link

tip_glom and annotations? #877

Open werdnus opened 6 years ago

werdnus commented 6 years ago

Okay, I've been banging my head against this wall for too long, and it can't just be me having this problem. I'm not even sure this is the appropriate place to ask — if it's not, can someone point me to the correct forum?

I'd like to be able to use PhyloSeq's tip_glom() or something similar to group closely related organisms in metabarcoding studies (16S amplicons), and have annotations with particular taxa (or OTUs or RSVs... or Bioinformatically Inferred Taxon Equivalents (BITEs)) propagate to the new tip.

As an example, I've got a sample of a microbial community from before some treatment, and one from after. I'd like to annotate those taxa that persist through the treatment. Given the sampling issues etc. with NGS metabarcoding studies, I think I'd like to look at the data at a slightly higher level of relatedness: I could just use SILVA and than apply tax_glom() at the genus level, but it turns out more than half of my organisms get <NA> at the genus level. Makes more sense to tackle it phylogenetically.

Let's make an example:

## ================
## Toy example 
## ================

## +++++
## install/load required packages:
library(phyloseq); packageVersion("phyloseq")

## install the dev build of treeio and ggtree
## (because I can't get the stable builds to play nice with each other)
devtools::install_github("GuangchuangYu/treeio")
library(treeio); packageVersion("treeio")
devtools::install_github("GuangchuangYu/ggtree")
library(ggtree); packageVersion("ggtree")

## +++++
## build an example:

## pull the data in
data("GlobalPatterns")

## prune out the first 50 taxa from the dataset
physeq = prune_taxa(taxa_names(GlobalPatterns)[1:50], GlobalPatterns)

## pull the tree out of the phyloseq object
GP.tree <- phy_tree(physeq)

## make a trait annotation for those 50 sequences
trait <- c(F,F,F,T,F,T,F,T,F,T,T,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F)
trait <- as.factor(trait) # doesn't work as logical

## put the trait in a dataframe
x <- data_frame(label = GP.tree$tip.label, trait = trait)

## convert the phylo object to a treeio::treedata object
GP.tree <- treedata(phylo = GP.tree)

## add the annotation
GP.tree <- full_join(GP.tree, x, by="label")

## take a look at the tree
ggtree(GP.tree) + geom_text2(aes(subset=!isTip, label=node), hjust=-.3) + geom_tiplab(aes(color = as.numeric(trait)))

full-tree

So, you can see that there's some phylogenetic signal going on here: all of the TRUE taxa are basically in a single clade (I did that on purpose). But, again, I'd like to simplify things by grouping related taxa — sort of like making OTUs (particularly because I'm working with DADA2 RSVs, and want to group them at about the species level).

PhyloSeq has a great tool for this, tip_glom(), but I can't use the treeio treedata object to make the phyloseq object; I have to convert back to an ape::phylo object to get back into PhyloSeq. And that means I lose my annotations, sadly. But, let's see what happens:

## +++++
## agglomerate tips to cluster similar taxa:

## agglomerate the tips with phyloseq
GP.tree.tip <- tip_glom(as.phylo(GP.tree), h = 0.1)

## convert back to a treeio::treedata object
GP.tree.tip <- treedata(phylo = GP.tree.tip)

## take a look at the tree
ggtree(GP.tree.tip) + geom_text2(aes(subset=!isTip, label=node), hjust=-.3) + geom_tiplab()

rplot

So, we end up compressing things, and it looks about like I'd like. But, the choice of which tip-labels end up as the final label for groups that are agglomerated seems a bit arbitrary — it might be the most basal taxon?

Mapping the agglomerated clades onto the initial tree, we can see that which taxa are glommed together is sometimes clear, and sometimes not. But, you can see that most of the pattern is lost!

## which clades get compressed, and what ends up as the tip label?
ggtree(GP.tree) + geom_text2(aes(subset=!isTip, label=node), hjust=-.3) + geom_tiplab() +
  geom_cladelabel(node=53, label="549322", align=TRUE, offset = 0.05) +
  geom_cladelabel(node=59, label="143239", align=TRUE, offset = 0.05) +
  geom_cladelabel(node=60, label="255340", align=TRUE, offset = 0.05) +
  geom_cladelabel(node=99, label="546313", align=TRUE, offset = 0.05) +
  geom_cladelabel(node=64, label="215972", align=TRUE, offset = 0.05) +
  geom_cladelabel(node=68, label="406058", align=TRUE, offset = 0.05) +
  geom_cladelabel(node=70, label="1126", align=TRUE, offset = 0.05) +
  geom_cladelabel(node=78, label="541313", align=TRUE, offset = 0.05) +
  #geom_cladelabel(node=59, label="group", align=TRUE, offset = 0.05) +
  geom_cladelabel(node=92, label="315545", align=TRUE, offset = 0.05) +
  geom_cladelabel(node=93, label="341551", align=TRUE, offset = 0.05) +
  geom_tiplab(aes(color = as.numeric(trait)))

So: what I want to do is really simple, and I can't believe that there's not a reasonably easy way to do it. How can I propagate those annotations to the final tip labels?

JocelynSP commented 6 years ago

A very partial answer - tip_glom labels the glommed taxa with the label of the input which has the largest count. (I have run into trouble mixing up gloms of taxa with raw counts and gloms with normalized counts, making chaos of my taxa-names.) ~I don't know if you can use this as a work-around, for example add a large number to the OTU count of samples with trait=='T' , perform tip_glom, then subtract it again?~

After a little further reading: tip_glom() uses merge_taxa(), which will name merged taxa by the largest counts if an OTU table is supplied, or by the first taxon in the list to be merged if no counts are available. As your example used the tree only, your tips are named by the 1st. tip_glom() has the option of passing arguments to the clustering function which by default is agnes(), which has a parameter trace.lev. If you do tip_glom(as.phylo(GP.tree), h = 0.1, trace.lev = 1) does it tell anything interesting? (I should try for myself)

grabear commented 6 years ago

You could try using https://github.com/grunwaldlab/metacoder

grabear commented 6 years ago

854