Closed nr0cinu closed 8 years ago
Sounds like an interesting feature request and prototype. Thanks!
joey
Happy to help/contribute :)
Here’s also an alternative version which can convert multiple taxonomic ranks at once
phyloseq_summarize_taxa <- function(psdata, taxonomic.ranks = rank_names(psdata)) {
if(length(taxonomic.ranks) > 1) {
names(taxonomic.ranks) <- taxonomic.ranks
llply(taxonomic.ranks, phyloseq_summarize_taxa, psdata = psdata)
} else {
taxa <- as(tax_table(psdata)[, taxonomic.ranks], 'character')
sum_tax_table <- summarize_taxa(as(otu_table(psdata), 'matrix'), taxa)
phyloseq(otu_table(sum_tax_table, taxa_are_rows = TRUE), sample_data(psdata, FALSE))
}
}
:+1: @and3k Very nice feature! @joey711 Any updates on the integration into phyloseq?
I'll try to take a look soon. I have some ideas. Although I like plyr, I suspect a data.table approach would be much faster.
Also, what is the end use for this? Is this for plotting? Or you're planning to do some sort of analysis on the aggregated phyloseq object? The reason I'm asking is if the end use scope is limited, it might save a lot of time to not make a new phyloseq object, for example. And for plotting a data.table is a preferred object anyway, because it inherits from data.frame and works directly with ggplot2.
Thanks for the kind feedback and code!
joey
summarize_taxa.py and compute_core_microbiome.py are some of the last functions I use in QIIME, other than the initial de-multiplexing.
I use the output of summarize_taxa to calculate the relative abundance of phyla, genera etc. Here is a clip from a rough draft of a manuscript I am working on:
"The phylum Proteobacteria was the most dominant with each individual having an average relative abundance of 88% (range 66 – 99%), and the most common genera within that phylum were Acinetobacter (average = 44%, range 5 – 83%), and Pseudomonas (average = 32%, range = 2 – 81 %)."
So, that is the type of information I would be looking for just quickly thinking about it. There might be other ways to calculate this, but summarize_taxa.py was a way that I knew to do this.
Cheers,
Carly
Thanks, that's helpful. Yes, there are some really fast ways to do this with data.table. When I get a chance I'll send an example and see if it fits the use-case.
Also, for specific mentions in a manuscript body text the table is nice, but it seems to me a more useful, if not always published, use case would be a graphic that summarizes this information. This is possible to do already with the scatter or bar panels in shiny-phyloseq. But a revamp of these with data.table for this purpose is probably worthwhile.
Thanks again
joey
The following is a prototype. Let me know what you think.
library("phyloseq")
library("data.table")
library("ggplot2")
fast_melt = function(physeq){
# supports "naked" otu_table as `physeq` input.
otutab = as(otu_table(physeq), "matrix")
if(!taxa_are_rows(physeq)){otutab <- t(otutab)}
otudt = data.table(otutab, keep.rownames = TRUE)
setnames(otudt, "rn", "taxaID")
# Enforce character taxaID key
otudt[, taxaIDchar := as.character(taxaID)]
otudt[, taxaID := NULL]
setnames(otudt, "taxaIDchar", "taxaID")
# Melt count table
mdt = melt.data.table(otudt,
id.vars = "taxaID",
variable.name = "SampleID",
value.name = "count")
# Remove zeroes, NAs
mdt <- mdt[count > 0][!is.na(count)]
# Calculate relative abundance
mdt[, RelativeAbundance := count / sum(count), by = SampleID]
if(!is.null(tax_table(physeq, errorIfNULL = FALSE))){
# If there is a tax_table, join with it. Otherwise, skip this join.
taxdt = data.table(as(tax_table(physeq, errorIfNULL = TRUE), "matrix"), keep.rownames = TRUE)
setnames(taxdt, "rn", "taxaID")
# Enforce character taxaID key
taxdt[, taxaIDchar := as.character(taxaID)]
taxdt[, taxaID := NULL]
setnames(taxdt, "taxaIDchar", "taxaID")
# Join with tax table
setkey(taxdt, "taxaID")
setkey(mdt, "taxaID")
mdt <- taxdt[mdt]
}
return(mdt)
}
summarize_taxa = function(physeq, Rank, GroupBy = NULL){
Rank <- Rank[1]
if(!Rank %in% rank_names(physeq)){
message("The argument to `Rank` was:\n", Rank,
"\nBut it was not found among taxonomic ranks:\n",
paste0(rank_names(physeq), collapse = ", "), "\n",
"Please check the list shown above and try again.")
}
if(!is.null(GroupBy)){
GroupBy <- GroupBy[1]
if(!GroupBy %in% sample_variables(physeq)){
message("The argument to `GroupBy` was:\n", GroupBy,
"\nBut it was not found among sample variables:\n",
paste0(sample_variables(physeq), collapse = ", "), "\n",
"Please check the list shown above and try again.")
}
}
# Start with fast melt
mdt = fast_melt(physeq)
if(!is.null(GroupBy)){
# Add the variable indicated in `GroupBy`, if provided.
sdt = data.table(SampleID = sample_names(physeq),
var1 = get_variable(physeq, GroupBy))
setnames(sdt, "var1", GroupBy)
# Join
setkey(sdt, SampleID)
setkey(mdt, SampleID)
mdt <- sdt[mdt]
}
# Summarize
summarydt = mdt[, list(meanRA = mean(RelativeAbundance),
sdRA = sd(RelativeAbundance),
minRA = min(RelativeAbundance),
maxRA = max(RelativeAbundance)),
by = c(Rank, GroupBy)]
return(summarydt)
}
plot_taxa_summary = function(physeq, Rank, GroupBy = NULL){
# Get taxa summary table
dt1 = summarize_taxa(physeq, Rank = Rank, GroupBy = GroupBy)
# Set factor appropriately for plotting
RankCol = which(colnames(dt1) == Rank)
setorder(dt1, -meanRA)
dt1[, RankFac := factor(dt1[[Rank]],
levels = rev(dt1[[Rank]]))]
dt1[, ebarMax := max(c(0, min(meanRA + sdRA))), by = eval(Rank)]
dt1[, ebarMin := max(c(0, min(meanRA - sdRA))), by = eval(Rank)]
# Set zeroes to one-tenth the smallest value
ebarMinFloor = dt1[(ebarMin > 0), min(ebarMin)]
ebarMinFloor <- ebarMinFloor / 10
dt1[(ebarMin == 0), ebarMin := ebarMinFloor]
pRank = ggplot(dt1, aes(x = meanRA, y = RankFac)) +
scale_x_log10() +
xlab("Mean Relative Abundance") +
ylab(Rank) +
theme_bw()
if(!is.null(GroupBy)){
# pRank <- pRank + facet_wrap(facets = as.formula(paste("~", GroupBy)))
pRank <- pRank + geom_point(mapping = aes_string(colour = GroupBy),
size = 5)
} else {
# Don't include error bars for faceted version
pRank <- pRank + geom_errorbarh(aes(xmax = ebarMax,
xmin = ebarMin))
}
return(pRank)
}
# Test
data("GlobalPatterns")
plot_taxa_summary(GlobalPatterns, "Phylum")
plot_taxa_summary(GlobalPatterns, "Phylum", "SampleType")
summarize_taxa(GlobalPatterns, "Phylum")
summarize_taxa(GlobalPatterns, "Phylum", "SampleType")
Thanks! The plots are informative and the output is exactly what I was looking for.
My only concern is that I am not getting similar results in relative abundance from what I got with QIIME.
When I use summarize_taxa, and then sum the meanRA it does not total anywhere close to 1. It sums to 0.053 for my data that I am using. I can see why it might not sum to exactly 1 since you are taking the average for each sample, but I would think that summing the average relative abundance would get close to 1.
I wonder if it has something to do with removing zeros in the fast_melt function before doing the relative abundance. Don't you want to leave zeros in? Hmm...so I tried altering the code and removing the remove zeros line and that made the mean RA sum be 0.0066 so even lower....
For GP dataset you get 0.008.
taxa2 <- summarize_taxa(GlobalPatterns, "Phylum") sum(taxa2$meanRA)
Thoughts?
There's no reasons the means should sum to 1.0 in this data. They are means across samples for each Phylum, and in some cases the data is very skewed. Note for instance Proteobacteria, where the range is from 1e-4
to >0.76. Including the zeros "drags downward" the mean even further, which is what you observed when you tried that.
For this dataset it probably only makes sense to consider central values (e.g. mean) among microbiomes that are comparable, for instance the same SampleType
in Global Patterns. I can't see much biological meaning in the mean value of relative abundance across samples that in some cases have very high proteobacteria, while in other cases never have any. The mean in a case like this ends up between 2 or more distributions that are far apart, and while numerically accurate, doesn't represent the observed relative abundance in either.
This is what motivated me to also include the GroupBy
argument. Try it out and see if those values are more meaningful within each microbiome class.
I see what you mean about zeroes, because this changes the calculation for each mean. It's inefficient to try to force the zeroes in, and we already know what the denominator should be in every case, sum(Nsamples)
.
The following lines
# Summarize
summarydt = mdt[, list(meanRA = mean(RelativeAbundance),
sdRA = sd(RelativeAbundance),
minRA = min(RelativeAbundance),
maxRA = max(RelativeAbundance)),
by = c(Rank, GroupBy)]
could be changed to
# Summarize
Nsamples = nsamples(physeq)
summarydt = mdt[, list(meanRA = sum(RelativeAbundance)/Nsamples,
sdRA = sd(RelativeAbundance),
minRA = min(RelativeAbundance),
maxRA = max(RelativeAbundance)),
by = c(Rank, GroupBy)]
and it would address your concerns about zero-entries being missing in the sparse (long-form) table.
For these same reasons sd
should be interpreted with a grain of salt... which is recommended anyway, given the data. It is a sd
of the RAs of non-zero entries only, and therefore representative only when most of the entries are non-zero. On the other hand, when you have a large fraction of zero values, it would still be difficult to interpret directly with zeroes left in because it is unknown if the taxa are actually missing, or present but unobserved at this sequencing depth.
Anyway, try out this change, and see if you now get the same values as QIIME (for what it's worth).
I used the sum to one to figure out that the calculations seem to be off. While the data can be skewed, the total sum of the mean relative abundance for each phylum or each level of choice should be close to one. You are taking the average so it should be a central value for each phylum, and then when you add up all the central values you should get close to one. If the data is skewed for a lot of phyla I could see the value being off but not coming to 0.008 as in the Global Patterns dataset. It just seemed off. I see your point for by sampletype argument, and thanks for adding that.
Great! Yes, when I made the suggested change to summarize_taxa it actually gives the same values as I got before. Cool!
And when I do the suggested change the sum of the meanRA of all phyla sum to 1!
sum(taxa$meanRA) [1] 1
Now, there are still values that appear to be off
As you can see the minRA and maxRA are incorrect. The mean is .877 and the sd is 0.09, which both look accurate from looking at my data. There is no way the maxRA could be 0.76 then. Also, when I look at the data, the lowest RA for a sample for Proteobacteria is 0.66 and the highest is 0.98. Not sure what is off there.
Thanks for the help!
Carly
Hi,
Is this function added to the Phyloseq package? I tried "summarize_taxa" and got "No documentation for ‘summarize_taxa’ in specified packages and libraries: you could try ‘??summarize_taxa’" error message. When I searched for it no results found. Following table is what I would like to have for exporting to a text file for statistical test etc. The first column is taxa rank (Phylum in this case) and each column following is the abundance of a sample (I don't know how to keep the tab separated format). I searched for several days and was not able to find any example. Thank you for your help! Guoyan
summarize_taxa(ot[, 1:5], tax[, 2]) CL3 CC1 SV1 M31Fcsw M11Fcsw ABY1_OD1 11 33 0 0 0 AC1 0 0 0 0 0 Acidobacteria 269202 183132 92863 120 115 Actinobacteria 39601 90280 121703 2540 841 AD3 1299 15577 17 2 1 Armatimonadetes 782 1124 1972 0 1
@guoyangzhao No, it's a function defined by and3k at the beginning of the thread.
Hi Kevin,
Thank you for your quick response!
Is there something like that implemented in Phyloseq? Can the sum of a particular rank of each sample be printed out as a table? I'm new to Phyloseq. Tried very hard to find examples. This thread is the closest I found. Thank you for your help!
Best, Guoyan
Sent from my iPhone
On Sep 13, 2016, at 11:07 PM, Kevin C Lee notifications@github.com wrote:
@guoyangzhao No, it's a function defined by and3k at the beginning of the thread.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Just add this in your code before you use the function in any particular session type:
phyloseq_summarize_taxa <- function(psdata, taxonomic.ranks = rank_names(psdata)) { if(length(taxonomic.ranks) > 1) { names(taxonomic.ranks) <- taxonomic.ranks llply(taxonomic.ranks, phyloseq_summarize_taxa, psdata = psdata) } else { taxa <- as(tax_table(psdata)[, taxonomic.ranks], 'character') sum_tax_table <- summarize_taxa(as(otu_table(psdata), 'matrix'), taxa) phyloseq(otu_table(sum_tax_table, taxa_are_rows = TRUE), sample_data(psdata, FALSE)) } }
On Wed, Sep 14, 2016 at 2:21 PM, guoyanzhao notifications@github.com wrote:
Hi Kevin,
Thank you for your quick response!
Is there something like that implemented in Phyloseq? Can the sum of a particular rank of each sample be printed out as a table? I'm new to Phyloseq. Tried very hard to find examples. This thread is the closest I found. Thank you for your help!
Best, Guoyan
Sent from my iPhone
On Sep 13, 2016, at 11:07 PM, Kevin C Lee notifications@github.com wrote:
@guoyangzhao No, it's a function defined by and3k at the beginning of the thread.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/joey711/phyloseq/issues/418#issuecomment-247010554, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJcvdeb2C0200A27nYz8IYPliEDWb4mks5qp_TpgaJpZM4DKkDy .
Susan Holmes Professor, Statistics and BioX John Henry Samter Fellow in Undergraduate Education Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/
Sorry for taking me so long to get back to you. Here is what worked for me:
glom <- tax_glom(ps0, taxrank = 'Phylum')
dat <- psmelt(glom)
dat$Phylum <- as.character(dat$Phylum)
Phylum_abundance <- aggregate(Abundance~Sample+Phylum, dat, FUN=sum)
library(reshape) Phylum_abundance <- cast(Phylum_abundance, Sample ~ Phylum)
| Sample | pActinobacteria | pArmatimonadetes | pBacteroidetes | pCyanobacteria | p__Firmicutes | | G35354 | 8493 | 0 | 3856 | 0 | 7333 | | G35355 | 8151 | 0 | 15331 | 0 | 12277 | | G35356 | 39271 | 0 | 2251 | 0 | 4517 | | G35370 | 7253 | 0 | 213 | 0 | 9851 |
It would be nice if can do this at a specific level (order, family etc.) for a given phylum (e.g. p__Actinobacteria) . Haven't figure out how to do it. Thanks,Guoyan From: Susan Holmes notifications@github.com To: joey711/phyloseq phyloseq@noreply.github.com Cc: guoyanzhao gy_zhao@yahoo.com; Comment comment@noreply.github.com Sent: Wednesday, September 14, 2016 8:41 AM Subject: Re: [joey711/phyloseq] Summarizing taxa, like in QIIME? (#418)
Just add this in your code before you use the function in any particular session type:
phyloseq_summarize_taxa <- function(psdata, taxonomic.ranks = rank_names(psdata)) { if(length(taxonomic.ranks) > 1) { names(taxonomic.ranks) <- taxonomic.ranks llply(taxonomic.ranks, phyloseq_summarize_taxa, psdata = psdata) } else { taxa <- as(tax_table(psdata)[, taxonomic.ranks], 'character') sum_tax_table <- summarize_taxa(as(otu_table(psdata), 'matrix'), taxa) phyloseq(otu_table(sum_tax_table, taxa_are_rows = TRUE), sample_data(psdata, FALSE)) } }
On Wed, Sep 14, 2016 at 2:21 PM, guoyanzhao notifications@github.com wrote:
Hi Kevin,
Thank you for your quick response!
Is there something like that implemented in Phyloseq? Can the sum of a particular rank of each sample be printed out as a table? I'm new to Phyloseq. Tried very hard to find examples. This thread is the closest I found. Thank you for your help!
Best, Guoyan
Sent from my iPhone
On Sep 13, 2016, at 11:07 PM, Kevin C Lee notifications@github.com wrote:
@guoyangzhao No, it's a function defined by and3k at the beginning of the thread.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/joey711/phyloseq/issues/418#issuecomment-247010554, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJcvdeb2C0200A27nYz8IYPliEDWb4mks5qp_TpgaJpZM4DKkDy .
Susan Holmes Professor, Statistics and BioX John Henry Samter Fellow in Undergraduate Education Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/ — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Is there a way to "Group By" more than one variable in the summarize_taxa and plot_taxa_summary. I found edit when all means add to 1 really helpful. Being able to summarize by more than one groups will really help further analysis.
Thanks!
So, _summarizetaxa function was never implemented in the package in the end? Or there is other issue covering this feature implementation?
Hello and thank you guys for creating this function. Ive been using the summarize function and code from guoyanzhao and It
s worked well for all phyla and as well as looking at the orders from the proteobacteria phylum. However, i can`t retrieve the data for lower level taxa e.g. Class,Order,Family....
The problem lies when I try and run the as.character function and get the error code: Error in $<-.data.frame
(*tmp*
, Class, value = c("Alphaproteobacteria", :
replacement has 2750 rows, data has 1400
I thought this error was being caused by the NAs in the data, so i got rid of them by using the function phyloseq_rm_na_tax. https://rdrr.io/github/vmikk/metagMisc/src/R/physeq_rm_na_tax.R. Though this doesn
t seem to help at all. Any help with this would be much appreciated :)
This is the code that im running: glomclass <- tax_glom(finalps.ass , taxrank = 'Class') cleanglomclass <- phyloseq_rm_na_tax(glomclass) datclass <- psmelt(cleanglomclass) dat$Class <- as.character(datclass$Class) Error in
$<-.data.frame(
tmp`, Class, value = c("Alphaproteobacteria", :
replacement has 2750 rows, data has 1400
Hi!
does phyloseq have a function that can summarize by taxonomic rank, like QIIME (http://qiime.org/scripts/summarize_taxa.html)?
I didn’t find anything, so I made my own functions, see below. (I you think its usefull feel free to add it to phyloseq.)
Keep up the great work!
Thanks, Bela
This does not do much input validation and assumes that taxa_are_rows == TRUE (because that’s how my data always looks like)