joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
579 stars 188 forks source link

Speed up psmelt through tidyverse functions #1041

Open mikemc opened 5 years ago

mikemc commented 5 years ago

I often convert my phyloseq object into a "tidy" data frame for data manipulation and visualization rather than using phyloseq's built-in functions. The psmelt function is slow enough on medium to large datasets to disrupt interactive workflows, and a significant speedup can be obtained by using functions from tibble and dplyr to merge the otu table, sample data, and tax tables, over the current approach. In particular, the function

tidy_psmelt <- function(physeq) {
    ### INSERT Initial variable and rank name checking and modding from `psmelt` source HERE
    # Get the OTU table with taxa as rows
    otutab <- otu_table(physeq)
    if (!taxa_are_rows(otutab)) {
        otutab <- t(otutab)
    }
    # Convert the otu table to a tibble in tidy form
    tb <- otutab %>% 
        as("matrix") %>%
        tibble::as_tibble(rownames = "OTU") %>%
        tidyr::gather("Sample", "Abundance", -OTU)
    # Add the sample data if it exists
    if (!is.null(sampleVars)) {
        sam <- sample_data(physeq) %>%
            as("data.frame") %>% 
            tibble::as_tibble(rownames = "Sample")
        tb <- tb %>%
            dplyr::left_join(sam, by = "Sample")
    }
    # Add the tax table if it exists
    if (!is.null(rankNames)) {
        tax <- tax_table(physeq) %>%
            as("matrix") %>%
            tibble::as_tibble(rownames = "OTU")
        tb <- tb %>%
            dplyr::left_join(tax, by = "OTU")
    }
    tb %>%
        arrange(desc(Abundance))
    # Optional conversion to a data frame doesn't affect the speed/memory usage
    # %>% as.data.frame
}

takes 0.24 seconds on GlobalPatterns versus 90 seconds for psmelt on my Lenovo X1 Carbon 5G laptop; I think about half of this 0.24s is from the final sorting of OTUs by Abundance. The function above is almost a drop in replacement for psmelt, except it returns a tibble and leaves the taxonomy variables as character vectors. The resulting tibble from this function is ~20% larger in memory (52.7 Mb versus 41.3 Mb) but this difference disappears if the tax table variables are converted to factors as psmelt does. (Actually, the resulting tibble or data frame takes up slightly less memory in this case for some reason).

joey711 commented 5 years ago

I like this idea. I'll mark it as a new feature to include. One issue to overcome is phyloseq's already bloated dependencies. Arguably these tidy packages are a good thing for users of phyloseq to have, anyway, but from experience, more-dependencies means more maintenance issues to resolve as those packages change and break phyloseq by accident. See: ggplot2 a few months ago.

mikemc commented 5 years ago

Yea, those seem like fair points. In this case, I expect that the functions used (as_tibble, gather, and left_join) should be safer than most from being changed. I think the argument for including dplyr would be strengthened if it would be similarly possible to speed up other functions, such as tax_glom. For instance, a quick-and-dirty Family level glom (skipping some phyloseq features regarding NA tax assignments and keeping an OTU name) done as below is very fast,

library(phyloseq)
library(dplyr)

data(GlobalPatterns)
ps <- GlobalPatterns

tb <- otu_table(ps) %>%       # Note, need taxa_as_rows = TRUE
    as("matrix") %>%
    as_tibble(rownames = "OTU")
tax <- tax_table(ps) %>%
    as("matrix") %>%
    as_tibble(rownames = "OTU")
tb <- tb %>%
    left_join(tax, by = "OTU")

group_ranks <- rank_names(ps)[seq(which(rank_names(ps) == "Family"))]
other_ranks <- setdiff(rank_names(ps), group_ranks)
tb0 <- tb %>%
    group_by_at(vars(group_ranks)) %>%
    summarize_at(vars(-OTU, -other_ranks), sum)
mikemc commented 5 years ago

My current reimplementation of psmelt is here; this version tries to be more consistent with phyloseq's psmelt, but the output differs in a couple ways noted here. I've also implemented a faster tax_glom() using dplyr in the same repo.

If we wanted to avoid introducing new dependencies, I expect we could translate the basic approach to use data.table instead, though I've never used it myself and so am not sure about this.

mikemc commented 4 years ago

https://github.com/mikemc/speedyseq/commit/76a4de76b71a333cdc175c0ba8a8fe18210f7111 provides a data.table implementation of psmelt() that is another 2x faster than the above.