Open mikemc opened 5 years ago
I like this idea. I'll mark it as a new feature to include. One issue to overcome is phyloseq's already bloated dependencies. Arguably these tidy packages are a good thing for users of phyloseq to have, anyway, but from experience, more-dependencies means more maintenance issues to resolve as those packages change and break phyloseq by accident. See: ggplot2 a few months ago.
Yea, those seem like fair points. In this case, I expect that the functions used (as_tibble
, gather
, and left_join
) should be safer than most from being changed. I think the argument for including dplyr
would be strengthened if it would be similarly possible to speed up other functions, such as tax_glom
. For instance, a quick-and-dirty Family level glom (skipping some phyloseq features regarding NA tax assignments and keeping an OTU name) done as below is very fast,
library(phyloseq)
library(dplyr)
data(GlobalPatterns)
ps <- GlobalPatterns
tb <- otu_table(ps) %>% # Note, need taxa_as_rows = TRUE
as("matrix") %>%
as_tibble(rownames = "OTU")
tax <- tax_table(ps) %>%
as("matrix") %>%
as_tibble(rownames = "OTU")
tb <- tb %>%
left_join(tax, by = "OTU")
group_ranks <- rank_names(ps)[seq(which(rank_names(ps) == "Family"))]
other_ranks <- setdiff(rank_names(ps), group_ranks)
tb0 <- tb %>%
group_by_at(vars(group_ranks)) %>%
summarize_at(vars(-OTU, -other_ranks), sum)
My current reimplementation of psmelt is here; this version tries to be more consistent with phyloseq's psmelt, but the output differs in a couple ways noted here. I've also implemented a faster tax_glom()
using dplyr in the same repo.
If we wanted to avoid introducing new dependencies, I expect we could translate the basic approach to use data.table
instead, though I've never used it myself and so am not sure about this.
https://github.com/mikemc/speedyseq/commit/76a4de76b71a333cdc175c0ba8a8fe18210f7111 provides a data.table implementation of psmelt()
that is another 2x faster than the above.
I often convert my phyloseq object into a "tidy" data frame for data manipulation and visualization rather than using
phyloseq
's built-in functions. Thepsmelt
function is slow enough on medium to large datasets to disrupt interactive workflows, and a significant speedup can be obtained by using functions fromtibble
anddplyr
to merge the otu table, sample data, and tax tables, over the current approach. In particular, the functiontakes 0.24 seconds on
GlobalPatterns
versus 90 seconds forpsmelt
on my Lenovo X1 Carbon 5G laptop; I think about half of this 0.24s is from the final sorting of OTUs by Abundance. The function above is almost a drop in replacement forpsmelt
, except it returns atibble
and leaves the taxonomy variables as character vectors. The resulting tibble from this function is ~20% larger in memory (52.7 Mb versus 41.3 Mb) but this difference disappears if the tax table variables are converted to factors aspsmelt
does. (Actually, the resulting tibble or data frame takes up slightly less memory in this case for some reason).