joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
587 stars 187 forks source link

Creating 100% stacked bar plots with less abundant taxa in a sub-category #901

Closed brookeweigel closed 6 years ago

brookeweigel commented 6 years ago

Hello! I tried to follow the logic from the following issue (https://github.com/joey711/phyloseq/issues/494)... using tips from @audy and @Nick243. But for some reason, when I try creating a new data frame with the phyla condensed into low abundance (< 1%) taxa, I end up with some samples that do not reach 100% on the bar graph... and I cannot figure out how some of the values went "missing" so that relative abundance does not total 100%. See the two plots compared below, 1) for the condensed phyla bar plot, and 2) for the normal relative abundance bar plot (which reaches 100% on the y axis for all samples). Any help would be very useful! Thanks!!

merge into one phyloseq object

physeq = phyloseq(OTU, TAX, META, phy_tree) physeq

Transform into relative abundances + filter to a mean threshold

physeq2 = filter_taxa(physeq, function(x) mean(x) > 0.1, TRUE) physeq2 physeq3 = transform_sample_counts(physeq2, function(x) x / sum(x) ) physeq3

choose samples (only stomach)

stomach <- subset_samples(physeq3, Sample_type=="Stomach") stomach

create dataframe from phyloseq object for stomach samples... to graph samples without condensed phyla

stom <- psmelt(stomach)

Condense low abundance taxa into an "Other" category for the barplot

Turn all OTUs into phylum counts

glom <- tax_glom(stomach, taxrank = 'phylum') glom # should list # taxa as # phyla data <- psmelt(glom) # create dataframe from phyloseq object data$phylum <- as.character(data$phylum) #convert to character

simple way to rename phyla with < 1% abundance

data$phylum[data$Abundance < 0.01] <- "< 1% abund."

I also tried using this method, based on median abundance threshold... either way, I still end up with sample that do not reach 100% on the bar graph

group dataframe by Phylum, calculate median rel. abundance

medians <- ddply(data, ~phylum, function(x) c(median=median(x$Abundance)))

find Phyla whose rel. abund. is less than 1%

remainder <- medians[medians$median <= 0.01,]$phylum

list of low abundance phyla

remainder

change their name to "Phyla < 1% abund."

data[data$phylum %in% remainder,]$Phylum <- "Phyla < 1% abund."

rename phyla with < 1% relative abundance

data$phylum[data$Abundance < 0.01] <- "Phyla < 1% abund."

plot with condensed phyla into "< 1% abund" category

p <- ggplot(data=data, aes(x=Sample, y=Abundance, fill=phylum)) p + geom_bar(aes(), stat="identity", position="stack") + scale_fill_manual(values = c("darkblue", "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1", "lightskyblue", "darkgreen", "deeppink", "khaki2", "firebrick", "brown1", "darkorange1", "cyan1", "royalblue4", "darksalmon", "darkblue", "royalblue4", "dodgerblue3", "steelblue1", "lightskyblue", "darkseagreen", "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1", "brown1", "darkorange1", "cyan1", "darkgrey")) + theme(legend.position="bottom") + guides(fill=guide_legend(nrow=5))

stomach_grouped_low_abund

compare to plot of same data, without condensed phyla

p <- ggplot(data=stom, aes(x=Sample, y=Abundance, fill=phylum)) p + geom_bar(aes(), stat="identity", position="stack") + scale_fill_manual(values = c("darkblue", "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1", "lightskyblue", "darkgreen", "deeppink", "khaki2", "firebrick", "brown1", "darkorange1", "cyan1", "royalblue4", "darksalmon", "darkblue", "royalblue4", "dodgerblue3", "steelblue1", "lightskyblue", "darkseagreen", "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1", "brown1", "darkorange1", "cyan1", "darkgrey", "darkblue", "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1", "lightskyblue", "darkgreen", "deeppink", "khaki2", "firebrick", "brown1", "darkorange1", "cyan1", "royalblue4", "darksalmon", "darkblue", "royalblue4", "dodgerblue3", "steelblue1", "lightskyblue", "darkseagreen", "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1", "brown1", "darkorange1", "cyan1", "darkgrey")) + theme(legend.position="bottom") + guides(fill=guide_legend(nrow=5))

stomach_no_groups

brookeweigel commented 6 years ago

Nevermind... I figured it out! I had some unassigned (blank) taxonomy cells... once I renamed those to "unknown", the gaps went away! Maybe this will be useful to somebody as well...

KMKemp commented 6 years ago

I solved this problem in a slightly different way using:

# Transform to relative abundance
AID_norm <- transform_sample_counts(AID_R, function(x) 100 * x/sum(x))

# Compile taxa by Order (filtering out low abundance taxa)
AID_Orders <- AID_norm  %>%
  tax_glom(taxrank = "Order") %>%                     # agglomerate taxa at order level
  psmelt() %>%                                        # Melt phyloseq object to long format for producing graphics with ggplot2
  filter(Abundance > 1.0)  %>%                        # Filter out orders below 1% in each sample
  arrange(desc(Phylum),desc(Class),desc(Order))

# Sum remaining taxa with a relative abundance < 1% and make a new dataframe
Remainders <- (AID_Orders) %>%
  group_by(SampleID,Mouse,Sex,Diet,Week,Description) %>% 
  summarise(Abundance = (100-sum(Abundance))) %>% 
  as.data.frame()
Remainders$Order<-"Orders < 1%"
Remainders$Phylum<-"_Orders < 1%"
Remainders$Class<-"_Orders < 1%"
Remainders$Kingdom<-"Bacteria"

# Join dataframes
AID_barchart <- full_join(AID_Orders,Remainders)

# reorder based on phylogeny
AID_barchart <- AID_barchart[with(AID_barchart, order(Phylum,Class,Order)),]

# lock in Order level
AID_barchart$Order <- factor(AID_barchart$Order, levels = AID_barchart$Order)

Then I plotted the stacked bar chart with ggplot2

saifsikdar commented 6 years ago

I figured it out. Just in case anybody needs to know. You just need to transform the phyloseq object using the psmelt() function and then you can generate the graph using ggplot() and command position = "fill" argument inside geom_bar() function. For example: ggplot(data = psmelt(phyloseq_object), mapping = aes_string(x = "variable of your data frame",y = "Abundance")) + geom_bar(aes(color=taxon name, fill=taxon name), stat="identity", position="fill")

rahulnccs commented 6 years ago

Even I am facing the same problem. I tried by removing unassigned taxa manually but still, the output is same. @brookeweigel how did you remove the blank taxonomic cells?

Thanks.

Nat211 commented 6 years ago

@brookeweigel I would also like to know which code you used to remove the unassigned taxa, I am having the same issue. Thanks for sharing!

brookeweigel commented 6 years ago

Hello @rahulnccs and @Nat211! It has been a little while since I looked at this, but here is my recent code that worked! It starts with a phyloseq object and ends with a nice looking stacked bar graph, using ggplots in R with the phyloseq package.

load libraries

library(ggplot2) library(phyloseq) library(tidyverse)

transform sample counts

physeq3 = transform_sample_counts(physeq2, function(x) x / sum(x) ) physeq3

Turn all OTUs into class (or phylum or order level) counts

glom <- tax_glom(physeq3, taxrank = 'class') glom # should list # taxa as # phyla data_glom<- psmelt(glom) # create dataframe from phyloseq object data_glom$class <- as.character(data_glom$class) #convert to character

simple way to rename phyla with < 1% abundance

data_glom$class[data_glom$Abundance < 0.01] <- "< 1% abund."

Count # phyla to set color palette

Count = length(unique(data_glom$class)) Count

To change order of bacterial classes, use factor() to change order of levels (for example, maybe you want to put the most abundant taxa first)... Use unique(data_glom$class) to see exact spelling of each class

unique(data_glom$class)

data_glom$class <- factor(data_glom$class, levels = c("Alphaproteobacteria", "Betaproteobacteria", "Deltaproteobacteria", "Gammaproteobacteria", "Flavobacteriia", "Verrucomicrobiae", "Planctomycetia", "Saprospirae", "Sphingobacteriia", "BD1-5", "Cytophagia", "Acidobacteriia", "OM190", "TA18", "Unknown", "< 1% abund."))

plot with condensed phyla into "unknown" category

spatial_plot <- ggplot(data=data_glom, aes(x=Sample, y=Abundance, fill=class)) + facet_grid(~Location, scales = "free") spatial_plot + geom_bar(aes(), stat="identity", position="stack") + scale_fill_manual(values = c("royalblue4", "deepskyblue", "blue", "cyan2", "darkorchid", "gold1", "forestgreen", "firebrick", "mediumspringgreen", "darkorange1", "saddlebrown", "deeppink", "slategray2", "seagreen", "snow4", "black")) + theme(legend.position="bottom") + guides(fill=guide_legend(nrow=5))

I hope this helps everyone!! :)

Slamazon4 commented 4 years ago

Hi @brookeweigel,

This code is great thank you! It was exactly what I'm looking for.

However, I'm still having the issue of the bar graph not reaching 100%.

Summer_squeeze

I've checked the Abundance values, there were a few that were 0 in the dataframe so I removed these rows, but I still seem to be having the same issues.

This is my code:

Squeeze samples only/over time

phy.squeeze = subset_samples(phy.3, Section == "Squeeze") phy.squeeze <- prune_taxa(taxa_sums(phy.squeeze) > 0, phy.squeeze)

Transform sample counts

phy.squeeze = transform_sample_counts(phy.squeeze, function(x) x / sum(x))

Turn all OTUs into class (or phylum or order level) counts

glom <- tax_glom(phy.squeeze, taxrank = "Phylum") glom # should list # taxa as # phyla data_glom <- psmelt(glom) # create dataframe from phyloseq object data_glom$Phylum <- as.character(data_glom$Phylum) #convert to character

Simple way to rename phyla with < 1% abundance

data_glom$Phylum[data_glom$Abundance < 0.01] <- "< 1% abund."

Remove any rows with 0 values for Abundance

Set zeros to NA

data_glom[data_glom==0] <- NA

Delete rows associated with NA

data_glom <- data_glom[complete.cases(data_glom),]

Count # phyla to set color palette

Count = length(unique(data_glom$Phylum)) Count

To change order of bacterial classes, use factor() to change order of levels (for example, maybe you want to put the most abundant taxa first)... Use unique(data_glom$class) to see exact spelling of each class

unique(data_glom$Phylum)

data_glom$Phylum <- factor(data_glom$Phylum, levels = c("Proteobacteria", "Firmicutes", "Tenericutes", "Bacteroidetes", "Actinobacteria", "Fusobacteria", "Patescibacteria", "Planctomycetes", "Verrucomicrobia", "Epsilonbacteraeota", "< 1% abund."))

plot with condensed phyla into "unknown" category

spatial_plot <- ggplot(data=data_glom, aes(x=Scores, y=Abundance, fill=Phylum)) + facet_grid(~Season, scales = "free") spatial_plot + geom_bar(aes(), stat="identity", position="stack") + theme(legend.position="bottom") + guides(fill=guide_legend(nrow=5))

brookeweigel commented 4 years ago

Hi @Slamazon4, it looks like your abundance values are not between 0-1, (they range from 0-20+), so the transform_sample_counts must not have worked properly. I can't find the exact reason for your error, but sometimes with phyloseq I have to clear my workspace environment in R and start over, because maybe you re-ran a previous code that wrote over the transformed sample count data. I'm sorry that I can't offer any better advice, but I think that the transform sample count step is the source of your problem (why the bar graph doesn't reach 100%). I hope you get it to work!

Slamazon4 commented 4 years ago

Thanks for getting back to me so quick @brookeweigel !

I did start fresh and run the code again, but the same thing happened. I checked the abundance values and they are all between 0 and 1 so I don't think that's it unfortunately :(

Do you think there's a chance it might be because the transformation to relative abundance is done prior to creating the dataframe, and for some reason when I agglomerate at Phylum level there are a few ASVs/features that come up with 0 values (which I'm removing) but they were included in the calculation for a 100%?

And perhaps also using facet_grid on the data for the "Season" variable splits the data again so it further distributes the 100% calculation?

I'm not sure if I'm making any sense, haha, but if you have any insight let me know.

PatoUru commented 4 years ago

Hi @brookeweigel! I followed you script step by step but... I'm still having the issue of the bar graph not reaching 100%. I didn't found where I have to change the unassigned (blank) taxonomy cells. It's a nice barplot but with some problem! image

The script that I used was: `> #Turn all OTUs into class (or phylum or order level) counts

glom <- tax_glom(tot2.ra, taxrank = 'Class') glom # should list # taxa as # phyla phyloseq-class experiment-level object otu_table() OTU Table: [ 132 taxa and 14 samples ] sample_data() Sample Data: [ 14 samples by 12 sample variables ] tax_table() Taxonomy Table: [ 132 taxa by 8 taxonomic ranks ] data_glom<- psmelt(glom) # create dataframe from phyloseq object data_glom$Class <- as.character(data_glom$Class) #convert to character

simple way to rename phyla with < 1% abundance

data_glom$Class[data_glom$Abundance < 1] <- "< 1% abund."

Count # phyla to set color palette

Count = length(unique(data_glom$Class)) Count [1] 41

To change order of bacterial classes, use factor() to change order of levels (for example,

maybe you want to put the most abundant taxa first)... Use unique(data_glom$class) to see exact spelling of each class

unique(data_glom$Class) [1] "cClostridia" "cGammaproteobacteria" "cAlphaproteobacteria" "cDesulfotomaculia" "cSynergistia"
[6] "c
Negativicutes" "cBacilli" "cAnaerolineae" "cParcubacteria" "cPlanctomycetes"
[11] "cVerrucomicrobiae" "cThermoanaerobacteria" "cSyntrophomonadia" "cVicinamibacteria" "cThermotogae"
[16] "c
Syntrophia" "cSpirochaetia" "cLeptospirae" "cKiritimatiellae" "cSyntrophorhabdia"
[21] "cFermentibacteria" "cActinobacteria" "cAminicenantia" "cPhycisphaerae" "cBacteroidia"
[26] "c
Subgroup_18" "cChlamydiae" "cBRH-c20a" "cSaccharimonadia" "cLentisphaeria"
[31] "c__Incertae_Sedis" "cDadabacteriia" "cABY1" "cDehalococcoidia" "cChloroflexia"
[36] "cMoorellia" "cuncultured" "cHydrogenedentia" "cPolyangia" "c__MVP-15"
[41] "< 1% abund."

plot with condensed phyla into "unknown" category

spatial_plot <- ggplot(data=data_glom, aes(x=Exp, y=Abundance, fill=Class)) + #facet_grid(~otros, scales = "free")

  • spatial_plot +

  • geom_bar(aes(), stat="identity", position="stack") +
  • scale_fill_manual(values = c("royalblue4", "deepskyblue", "blue", "cyan2", "darkorchid",

  • "gold1", "forestgreen", "firebrick", "mediumspringgreen", "darkorange1",

  • "saddlebrown", "deeppink", "slategray2", "seagreen", "snow4", "black")) +

  • scale_fill_manual(values = c("#70dc85", "#78ac83", "#006602", "#7e5b00", "#5d7fff", "#cb88a6", "#a152da", "#0079b9", "#81d33d",.......................")) +
  • theme(legend.position="bottom") + guides(fill=guide_legend(nrow=5)) spatial_plot2 <- spatial_plot + scale_x_discrete(limits = newSTorder) spatial_plot2`
Slamazon4 commented 4 years ago

@brookeweigel @PatoUru

I figured out what my issue was. I needed to plot 'samples' on the x axis for the script to work. I was trying to plot one of my categorical variables instead.

So instead of this:

spatial_plot <- ggplot(data=data_glom, aes(x=Scores, y=Abundance, fill=Phylum)) + facet_grid(~Season, scales = "free") spatial_plot + geom_bar(aes(), stat="identity", position="stack") + theme(legend.position="bottom") + guides(fill=guide_legend(nrow=5))

I did this:

spatial_plot <- ggplot(data=data_glom, aes(x=samples, y=Abundance, fill=Phylum)) + face_grid(~Season, scales = "free") spatial_plot + geom_bar(aes(), stat="identity", position="stack") + theme(legend.position="bottom") + guides(fill=guide_legend(nrow=5))

If I wanted to plot my variables along the x axis I had to transform my data x2. For example, if I wanted to plot by season:

Transform sample counts phy.squeeze = transform_sample_counts(phy.squeeze, function(x) x / sum(x))

Turn all OTUs into class (or phylum or order level) counts glom.phylum <- tax_glom(phy.squeeze, taxrank = "Phylum")

And then merge samples for whatever variable I want to plot along the x axis glom.season <- merge_samples(glom.phylum, "Season"

Transform counts again glom.season.phylum = transform_sample_counts(glom.season function(x) x / sum(x)) data_glom <- psmelt(glom.season.phylum) # create dataframe from phyloseq object

Now I can plot my desired variable along the x axis season <- ggplot(data=data_glom, aes(x=Season, y=Abundance, fill=Phylum)) + geom_bar(aes(),stat="identity", position="stack") + scale_fill_manual(values = c("royalblue3","darkgoldenrod1", "cyan2", "darkorchid4", "black")) + theme(legend.position="bottom")

gagecoon commented 2 years ago

@PatoUru

I ran the tax_glom line with my raw counts phyloseq object and then converted that output to relative abundance rather than using relative abundance to select for one division of taxonomy. This resolved my issue with having the plot come up short of 1.0, hope this helps.

Hina1112 commented 1 year ago

Hi, i need help. When i plot my otu, my graph plot showed the same group sample apart. How can I arrange them, so that they come next to each other? e.g I have LF1, LF2, LF3 AND HF1,HF2,HF3.. the bar graph comes out like HF1, LF2,LF3,HF2,HF3.

sethyyyyy commented 4 months ago

Thanks for the help! I had a similar problem