joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
584 stars 187 forks source link

tax_glom then psmelt centered-log-ratio transformed data? #1636

Open gabrielet opened 2 years ago

gabrielet commented 2 years ago

I wanted to plot abundances using centered log ratio, so I did the transformation. Then, I needed to agglomerate my taxa by some specific taxonomic level, e.g. Phylum, and after that I psmelt-ed the resulting table to plot.

The question I have: is it correct to perform tax_glom() and psmelt() on centered log ratio-ed data? Are these re-organisations of the dataset affecting the counts, somehow? I feel it might not be a good idea but I am not sure (I do that with relative abundances, but then these are not log-transformed data).

The only two piece of info i found are this post and this FAQ but they are not useful, for my purpose.

A general idea of how this works:

# aggregate
phylo_aggregated <- tax_glom(phylo_data, taxrank=taxa_level)

# psmelt
psmelted <- psmelt(phylo_aggregated)

# plot
psmelted[psmelted$TimePoint==timepoint, ] %>% # subset by timepoint
    filter(Abundance >= threshold | Abundance <= -threshold) %>% # subset by abundance
        mutate(Phylum = fct_reorder(Phylum, Abundance, .desc=T)) %>% # reorder by abundance
        ggplot(aes(x=Treatment, y=Abundance, fill=Phylum)) + # plot
        geom_boxplot() +
        scale_fill_viridis(discrete=T, option="turbo") +
        theme_bw() +
        theme(text=element_text(size=25, face="bold"), legend.position="none", axis.text.x = element_text(angle=0), strip.text=element_text(size=15)) +
        guides(fill=guide_legend(ncol=1)) +
        xlab("") +
        ylab("") +
        facet_wrap(~Phylum) +
        ggtitle(paste0("Taxa composition in ", timep, " at ", taxa_level, " level"))
gabrielet commented 2 years ago

ok, so I found a solution that I think may work. what I do is to compute the best taxa using the agglomerated values, then I prune those taxa that are not among the best. finally I psmelt this pruned dataset and I plot using the centered log ratios.

# aggregate phylum
phylo_agglomerated <- tax_glom(phylo_data, taxrank=taxa_level)

# create centered log ratio object
...

# get names of best n taxa
max_taxa <- 6
best_taxa <- names(sort(taxa_sums(phylo_agglomerated), decreasing=TRUE)[1:max_taxa])

# get best max_taxa from the centered log ratio object
phylo_thresholded <- prune_taxa(best_taxa, phylo_clr)

# psmelt
psmelted <- psmelt(phylo_thresholded)

# plot
psmelted %>%
  filter(TimePoint==timep) %>% # subset by time point
  filter(Treatment==a_treat) %>% # filter by treatment
  group_by(DNA_RNA) %>% # group by treatment
  mutate(Phylum = fct_reorder(Phylum, Abundance, .desc=T)) %>% # reorder by abundance
  ggplot(aes(x=DNA_RNA, y=Abundance, fill=Phylum)) + # plot
  geom_boxplot() +
  scale_fill_viridis(discrete=T, option="turbo") +
  theme_bw() +
  theme(text=element_text(size=25, face="bold"), legend.position="none", axis.text.x = element_text(angle=0), strip.text=element_text(size=15)) +
  guides(fill=guide_legend(ncol=1)) +
  xlab("") +
  ylab("") +
  facet_wrap(~Phylum) +
  ggtitle(paste0(taxa_level, " composition comparison between DNA and RNA in ", timep))