joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
581 stars 187 forks source link

Summarizing taxa, like in QIIME? #418

Closed nr0cinu closed 8 years ago

nr0cinu commented 9 years ago

Hi!

does phyloseq have a function that can summarize by taxonomic rank, like QIIME (http://qiime.org/scripts/summarize_taxa.html)?

I didn’t find anything, so I made my own functions, see below. (I you think its usefull feel free to add it to phyloseq.)

Keep up the great work!

Thanks, Bela

# Generic function
summarize_taxa <- function(counts, taxonomy) {
  if(is.matrix(taxonomy)) {
    #message('multiple taxonomies')
    alply(taxonomy, 2, summarize_taxa, counts = counts, .dims = TRUE)
  } else if(is.matrix(counts)) {
    #message('multiple counts')
    require('plyr')
    apply(counts, 2, summarize_taxa, taxonomy = taxonomy)
  } else {
    #message('summarize')
    tapply(counts, taxonomy, sum)
  }
}

library('phyloseq')
data('GlobalPatterns')
psdata <- GlobalPatterns

ot <- as(otu_table(psdata), 'matrix')
tax <- as(tax_table(psdata), 'matrix')

summarize_taxa(ot[, 1], tax[, 1])
summarize_taxa(ot[, 1:2], tax[, 1])
summarize_taxa(ot[, 1], tax[, 1:2])
summarize_taxa(ot[, 1:2], tax[, 1:2])
summarize_taxa(ot, tax)

# Takes phyloseq data and returns phyloseq data (OTUs are now taxonomic ranks)
phyloseq_summarize_taxa <- function(psdata, taxonomic.rank = rank_names(psdata)[1], errorIfNULL = TRUE) {
  taxa <- as(tax_table(psdata, errorIfNULL)[, taxonomic.rank], 'character')
  sum_tax_table <- summarize_taxa(as(otu_table(psdata), 'matrix'), taxa)
  phyloseq(otu_table(sum_tax_table, taxa_are_rows = TRUE), sample_data(psdata, FALSE))
}

phyloseq_summarize_taxa(psdata)
phyloseq_summarize_taxa(psdata, 'Phylum')
phyloseq_summarize_taxa(phyloseq(otu_table(psdata), tax_table(psdata)))

This does not do much input validation and assumes that taxa_are_rows == TRUE (because that’s how my data always looks like)

joey711 commented 9 years ago

Sounds like an interesting feature request and prototype. Thanks!

joey

nr0cinu commented 9 years ago

Happy to help/contribute :)

Here’s also an alternative version which can convert multiple taxonomic ranks at once

phyloseq_summarize_taxa <- function(psdata, taxonomic.ranks = rank_names(psdata)) {
  if(length(taxonomic.ranks) > 1) {
    names(taxonomic.ranks) <- taxonomic.ranks
    llply(taxonomic.ranks, phyloseq_summarize_taxa, psdata = psdata)
  } else {
    taxa <- as(tax_table(psdata)[, taxonomic.ranks], 'character')
    sum_tax_table <- summarize_taxa(as(otu_table(psdata), 'matrix'), taxa)
    phyloseq(otu_table(sum_tax_table, taxa_are_rows = TRUE), sample_data(psdata, FALSE))    
  }
}
claczny commented 8 years ago

:+1: @and3k Very nice feature! @joey711 Any updates on the integration into phyloseq?

joey711 commented 8 years ago

I'll try to take a look soon. I have some ideas. Although I like plyr, I suspect a data.table approach would be much faster.

Also, what is the end use for this? Is this for plotting? Or you're planning to do some sort of analysis on the aggregated phyloseq object? The reason I'm asking is if the end use scope is limited, it might save a lot of time to not make a new phyloseq object, for example. And for plotting a data.table is a preferred object anyway, because it inherits from data.frame and works directly with ggplot2.

Thanks for the kind feedback and code!

joey

CarlyMuletzWolz commented 8 years ago

summarize_taxa.py and compute_core_microbiome.py are some of the last functions I use in QIIME, other than the initial de-multiplexing.

I use the output of summarize_taxa to calculate the relative abundance of phyla, genera etc. Here is a clip from a rough draft of a manuscript I am working on:

"The phylum Proteobacteria was the most dominant with each individual having an average relative abundance of 88% (range 66 – 99%), and the most common genera within that phylum were Acinetobacter (average = 44%, range 5 – 83%), and Pseudomonas (average = 32%, range = 2 – 81 %)."

So, that is the type of information I would be looking for just quickly thinking about it. There might be other ways to calculate this, but summarize_taxa.py was a way that I knew to do this.

Cheers,

Carly

joey711 commented 8 years ago

Thanks, that's helpful. Yes, there are some really fast ways to do this with data.table. When I get a chance I'll send an example and see if it fits the use-case.

Also, for specific mentions in a manuscript body text the table is nice, but it seems to me a more useful, if not always published, use case would be a graphic that summarizes this information. This is possible to do already with the scatter or bar panels in shiny-phyloseq. But a revamp of these with data.table for this purpose is probably worthwhile.

Thanks again

joey

joey711 commented 8 years ago

The following is a prototype. Let me know what you think.


library("phyloseq")
library("data.table")
library("ggplot2")

fast_melt = function(physeq){
  # supports "naked" otu_table as `physeq` input.
  otutab = as(otu_table(physeq), "matrix")
  if(!taxa_are_rows(physeq)){otutab <- t(otutab)}
  otudt = data.table(otutab, keep.rownames = TRUE)
  setnames(otudt, "rn", "taxaID")
  # Enforce character taxaID key
  otudt[, taxaIDchar := as.character(taxaID)]
  otudt[, taxaID := NULL]
  setnames(otudt, "taxaIDchar", "taxaID")
  # Melt count table
  mdt = melt.data.table(otudt, 
                        id.vars = "taxaID",
                        variable.name = "SampleID",
                        value.name = "count")
  # Remove zeroes, NAs
  mdt <- mdt[count > 0][!is.na(count)]
  # Calculate relative abundance
  mdt[, RelativeAbundance := count / sum(count), by = SampleID]
  if(!is.null(tax_table(physeq, errorIfNULL = FALSE))){
    # If there is a tax_table, join with it. Otherwise, skip this join.
    taxdt = data.table(as(tax_table(physeq, errorIfNULL = TRUE), "matrix"), keep.rownames = TRUE)
    setnames(taxdt, "rn", "taxaID")
    # Enforce character taxaID key
    taxdt[, taxaIDchar := as.character(taxaID)]
    taxdt[, taxaID := NULL]
    setnames(taxdt, "taxaIDchar", "taxaID")
    # Join with tax table
    setkey(taxdt, "taxaID")
    setkey(mdt, "taxaID")
    mdt <- taxdt[mdt]
  }
  return(mdt)
}

summarize_taxa = function(physeq, Rank, GroupBy = NULL){
  Rank <- Rank[1]
  if(!Rank %in% rank_names(physeq)){
    message("The argument to `Rank` was:\n", Rank,
            "\nBut it was not found among taxonomic ranks:\n",
            paste0(rank_names(physeq), collapse = ", "), "\n",
            "Please check the list shown above and try again.")
  }
  if(!is.null(GroupBy)){
    GroupBy <- GroupBy[1]
    if(!GroupBy %in% sample_variables(physeq)){
      message("The argument to `GroupBy` was:\n", GroupBy,
              "\nBut it was not found among sample variables:\n",
              paste0(sample_variables(physeq), collapse = ", "), "\n",
              "Please check the list shown above and try again.")
    }
  }
  # Start with fast melt
  mdt = fast_melt(physeq)
  if(!is.null(GroupBy)){
    # Add the variable indicated in `GroupBy`, if provided.
    sdt = data.table(SampleID = sample_names(physeq),
                     var1 = get_variable(physeq, GroupBy))
    setnames(sdt, "var1", GroupBy)
    # Join
    setkey(sdt, SampleID)
    setkey(mdt, SampleID)
    mdt <- sdt[mdt]
  }
  # Summarize
  summarydt = mdt[, list(meanRA = mean(RelativeAbundance),
                         sdRA = sd(RelativeAbundance),
                         minRA = min(RelativeAbundance),
                         maxRA = max(RelativeAbundance)),
                  by = c(Rank, GroupBy)]
  return(summarydt)
}

plot_taxa_summary = function(physeq, Rank, GroupBy = NULL){
  # Get taxa summary table 
  dt1 = summarize_taxa(physeq, Rank = Rank, GroupBy = GroupBy)
  # Set factor appropriately for plotting
  RankCol = which(colnames(dt1) == Rank)
  setorder(dt1, -meanRA)
  dt1[, RankFac := factor(dt1[[Rank]], 
                          levels = rev(dt1[[Rank]]))]
  dt1[, ebarMax := max(c(0, min(meanRA + sdRA))), by = eval(Rank)]
  dt1[, ebarMin := max(c(0, min(meanRA - sdRA))), by = eval(Rank)]
  # Set zeroes to one-tenth the smallest value
  ebarMinFloor = dt1[(ebarMin > 0), min(ebarMin)]
  ebarMinFloor <- ebarMinFloor / 10
  dt1[(ebarMin == 0), ebarMin := ebarMinFloor]

  pRank = ggplot(dt1, aes(x = meanRA, y = RankFac)) +
    scale_x_log10() +
    xlab("Mean Relative Abundance") +
    ylab(Rank) +
    theme_bw()
  if(!is.null(GroupBy)){
    # pRank <- pRank + facet_wrap(facets = as.formula(paste("~", GroupBy)))
    pRank <- pRank + geom_point(mapping = aes_string(colour = GroupBy),
                                size = 5)
  } else {
    # Don't include error bars for faceted version
    pRank <- pRank + geom_errorbarh(aes(xmax = ebarMax,
                                        xmin = ebarMin))
  }
  return(pRank)
}

# Test
data("GlobalPatterns")
plot_taxa_summary(GlobalPatterns, "Phylum")
plot_taxa_summary(GlobalPatterns, "Phylum", "SampleType")
summarize_taxa(GlobalPatterns, "Phylum")
summarize_taxa(GlobalPatterns, "Phylum", "SampleType")
CarlyMuletzWolz commented 8 years ago

Thanks! The plots are informative and the output is exactly what I was looking for.

My only concern is that I am not getting similar results in relative abundance from what I got with QIIME.

When I use summarize_taxa, and then sum the meanRA it does not total anywhere close to 1. It sums to 0.053 for my data that I am using. I can see why it might not sum to exactly 1 since you are taking the average for each sample, but I would think that summing the average relative abundance would get close to 1.

I wonder if it has something to do with removing zeros in the fast_melt function before doing the relative abundance. Don't you want to leave zeros in? Hmm...so I tried altering the code and removing the remove zeros line and that made the mean RA sum be 0.0066 so even lower....

screen shot 2015-12-29 at 2 15 23 pm

For GP dataset you get 0.008.

taxa2 <- summarize_taxa(GlobalPatterns, "Phylum") sum(taxa2$meanRA)

Thoughts?

joey711 commented 8 years ago

Sum to 1

There's no reasons the means should sum to 1.0 in this data. They are means across samples for each Phylum, and in some cases the data is very skewed. Note for instance Proteobacteria, where the range is from 1e-4 to >0.76. Including the zeros "drags downward" the mean even further, which is what you observed when you tried that.

For this dataset it probably only makes sense to consider central values (e.g. mean) among microbiomes that are comparable, for instance the same SampleType in Global Patterns. I can't see much biological meaning in the mean value of relative abundance across samples that in some cases have very high proteobacteria, while in other cases never have any. The mean in a case like this ends up between 2 or more distributions that are far apart, and while numerically accurate, doesn't represent the observed relative abundance in either.

This is what motivated me to also include the GroupBy argument. Try it out and see if those values are more meaningful within each microbiome class.

Missing Zeroes

I see what you mean about zeroes, because this changes the calculation for each mean. It's inefficient to try to force the zeroes in, and we already know what the denominator should be in every case, sum(Nsamples).

The following lines

  # Summarize
  summarydt = mdt[, list(meanRA = mean(RelativeAbundance),
                         sdRA = sd(RelativeAbundance),
                         minRA = min(RelativeAbundance),
                         maxRA = max(RelativeAbundance)),
                  by = c(Rank, GroupBy)]

could be changed to

  # Summarize
Nsamples = nsamples(physeq)
  summarydt = mdt[, list(meanRA = sum(RelativeAbundance)/Nsamples,
                         sdRA = sd(RelativeAbundance),
                         minRA = min(RelativeAbundance),
                         maxRA = max(RelativeAbundance)),
                  by = c(Rank, GroupBy)]

and it would address your concerns about zero-entries being missing in the sparse (long-form) table.

For these same reasons sd should be interpreted with a grain of salt... which is recommended anyway, given the data. It is a sd of the RAs of non-zero entries only, and therefore representative only when most of the entries are non-zero. On the other hand, when you have a large fraction of zero values, it would still be difficult to interpret directly with zeroes left in because it is unknown if the taxa are actually missing, or present but unobserved at this sequencing depth.

Anyway, try out this change, and see if you now get the same values as QIIME (for what it's worth).

CarlyMuletzWolz commented 8 years ago

Sum to one

I used the sum to one to figure out that the calculations seem to be off. While the data can be skewed, the total sum of the mean relative abundance for each phylum or each level of choice should be close to one. You are taking the average so it should be a central value for each phylum, and then when you add up all the central values you should get close to one. If the data is skewed for a lot of phyla I could see the value being off but not coming to 0.008 as in the Global Patterns dataset. It just seemed off. I see your point for by sampletype argument, and thanks for adding that.

Suggested change

Great! Yes, when I made the suggested change to summarize_taxa it actually gives the same values as I got before. Cool!

And when I do the suggested change the sum of the meanRA of all phyla sum to 1!

sum(taxa$meanRA) [1] 1

Now, there are still values that appear to be off screen shot 2015-12-29 at 8 50 19 pm

As you can see the minRA and maxRA are incorrect. The mean is .877 and the sd is 0.09, which both look accurate from looking at my data. There is no way the maxRA could be 0.76 then. Also, when I look at the data, the lowest RA for a sample for Proteobacteria is 0.66 and the highest is 0.98. Not sure what is off there.

Thanks for the help!

Carly

guoyanzhao commented 8 years ago

Hi,

Is this function added to the Phyloseq package? I tried "summarize_taxa" and got "No documentation for ‘summarize_taxa’ in specified packages and libraries: you could try ‘??summarize_taxa’" error message. When I searched for it no results found. Following table is what I would like to have for exporting to a text file for statistical test etc. The first column is taxa rank (Phylum in this case) and each column following is the abundance of a sample (I don't know how to keep the tab separated format). I searched for several days and was not able to find any example. Thank you for your help! Guoyan

summarize_taxa(ot[, 1:5], tax[, 2]) CL3 CC1 SV1 M31Fcsw M11Fcsw ABY1_OD1 11 33 0 0 0 AC1 0 0 0 0 0 Acidobacteria 269202 183132 92863 120 115 Actinobacteria 39601 90280 121703 2540 841 AD3 1299 15577 17 2 1 Armatimonadetes 782 1124 1972 0 1

cyklee commented 8 years ago

@guoyangzhao No, it's a function defined by and3k at the beginning of the thread.

guoyanzhao commented 8 years ago

Hi Kevin,

Thank you for your quick response!

Is there something like that implemented in Phyloseq? Can the sum of a particular rank of each sample be printed out as a table? I'm new to Phyloseq. Tried very hard to find examples. This thread is the closest I found. Thank you for your help!

Best, Guoyan

Sent from my iPhone

On Sep 13, 2016, at 11:07 PM, Kevin C Lee notifications@github.com wrote:

@guoyangzhao No, it's a function defined by and3k at the beginning of the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

spholmes commented 8 years ago

Just add this in your code before you use the function in any particular session type:

phyloseq_summarize_taxa <- function(psdata, taxonomic.ranks = rank_names(psdata)) { if(length(taxonomic.ranks) > 1) { names(taxonomic.ranks) <- taxonomic.ranks llply(taxonomic.ranks, phyloseq_summarize_taxa, psdata = psdata) } else { taxa <- as(tax_table(psdata)[, taxonomic.ranks], 'character') sum_tax_table <- summarize_taxa(as(otu_table(psdata), 'matrix'), taxa) phyloseq(otu_table(sum_tax_table, taxa_are_rows = TRUE), sample_data(psdata, FALSE)) } }

On Wed, Sep 14, 2016 at 2:21 PM, guoyanzhao notifications@github.com wrote:

Hi Kevin,

Thank you for your quick response!

Is there something like that implemented in Phyloseq? Can the sum of a particular rank of each sample be printed out as a table? I'm new to Phyloseq. Tried very hard to find examples. This thread is the closest I found. Thank you for your help!

Best, Guoyan

Sent from my iPhone

On Sep 13, 2016, at 11:07 PM, Kevin C Lee notifications@github.com wrote:

@guoyangzhao No, it's a function defined by and3k at the beginning of the thread.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/joey711/phyloseq/issues/418#issuecomment-247010554, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJcvdeb2C0200A27nYz8IYPliEDWb4mks5qp_TpgaJpZM4DKkDy .

Susan Holmes Professor, Statistics and BioX John Henry Samter Fellow in Undergraduate Education Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/

guoyanzhao commented 7 years ago

Sorry for taking me so long to get back to you. Here is what worked for me:

agglomerate taxa

glom <- tax_glom(ps0, taxrank = 'Phylum')

create dataframe from phyloseq object

dat <- psmelt(glom)

convert Phylum to a character vector from a factor

dat$Phylum <- as.character(dat$Phylum)

aggregate

Phylum_abundance <- aggregate(Abundance~Sample+Phylum, dat, FUN=sum)

reorganize the table so that each phylum is a column

library(reshape) Phylum_abundance <- cast(Phylum_abundance, Sample ~ Phylum)

this is what I get

| Sample | pActinobacteria | pArmatimonadetes | pBacteroidetes | pCyanobacteria | p__Firmicutes | | G35354 | 8493 | 0 | 3856 | 0 | 7333 | | G35355 | 8151 | 0 | 15331 | 0 | 12277 | | G35356 | 39271 | 0 | 2251 | 0 | 4517 | | G35370 | 7253 | 0 | 213 | 0 | 9851 |

It would be nice if can do this at a specific level (order, family etc.) for a given phylum (e.g. p__Actinobacteria) . Haven't figure out how to do it. Thanks,Guoyan From: Susan Holmes notifications@github.com To: joey711/phyloseq phyloseq@noreply.github.com Cc: guoyanzhao gy_zhao@yahoo.com; Comment comment@noreply.github.com Sent: Wednesday, September 14, 2016 8:41 AM Subject: Re: [joey711/phyloseq] Summarizing taxa, like in QIIME? (#418)

Just add this in your code before you use the function in any particular session type:

phyloseq_summarize_taxa <- function(psdata, taxonomic.ranks = rank_names(psdata)) { if(length(taxonomic.ranks) > 1) { names(taxonomic.ranks) <- taxonomic.ranks llply(taxonomic.ranks, phyloseq_summarize_taxa, psdata = psdata) } else { taxa <- as(tax_table(psdata)[, taxonomic.ranks], 'character') sum_tax_table <- summarize_taxa(as(otu_table(psdata), 'matrix'), taxa) phyloseq(otu_table(sum_tax_table, taxa_are_rows = TRUE), sample_data(psdata, FALSE)) } }

On Wed, Sep 14, 2016 at 2:21 PM, guoyanzhao notifications@github.com wrote:

Hi Kevin,

Thank you for your quick response!

Is there something like that implemented in Phyloseq? Can the sum of a particular rank of each sample be printed out as a table? I'm new to Phyloseq. Tried very hard to find examples. This thread is the closest I found. Thank you for your help!

Best, Guoyan

Sent from my iPhone

On Sep 13, 2016, at 11:07 PM, Kevin C Lee notifications@github.com wrote:

@guoyangzhao No, it's a function defined by and3k at the beginning of the thread.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/joey711/phyloseq/issues/418#issuecomment-247010554, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJcvdeb2C0200A27nYz8IYPliEDWb4mks5qp_TpgaJpZM4DKkDy .

Susan Holmes Professor, Statistics and BioX John Henry Samter Fellow in Undergraduate Education Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/ — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

NainaKhandelwal commented 7 years ago

Is there a way to "Group By" more than one variable in the summarize_taxa and plot_taxa_summary. I found edit when all means add to 1 really helpful. Being able to summarize by more than one groups will really help further analysis.

Thanks!

alnf commented 5 years ago

So, _summarizetaxa function was never implemented in the package in the end? Or there is other issue covering this feature implementation?

AnthonyScales commented 5 years ago

Hello and thank you guys for creating this function. Ive been using the summarize function and code from guoyanzhao and Its worked well for all phyla and as well as looking at the orders from the proteobacteria phylum. However, i can`t retrieve the data for lower level taxa e.g. Class,Order,Family....

The problem lies when I try and run the as.character function and get the error code: Error in $<-.data.frame(*tmp*, Class, value = c("Alphaproteobacteria", : replacement has 2750 rows, data has 1400

I thought this error was being caused by the NAs in the data, so i got rid of them by using the function phyloseq_rm_na_tax. https://rdrr.io/github/vmikk/metagMisc/src/R/physeq_rm_na_tax.R. Though this doesnt seem to help at all. Any help with this would be much appreciated :)

This is the code that im running: glomclass <- tax_glom(finalps.ass , taxrank = 'Class') cleanglomclass <- phyloseq_rm_na_tax(glomclass) datclass <- psmelt(cleanglomclass) dat$Class <- as.character(datclass$Class) Error in$<-.data.frame(tmp`, Class, value = c("Alphaproteobacteria", : replacement has 2750 rows, data has 1400