Open FADHLyemen opened 5 years ago
By setting x = "Status"
, you are going to be combining samples with the same Status
. I suggest reading the documentation for ggplot
's geom_bar
function and https://r4ds.had.co.nz/data-visualisation.html to understand more what is happening here.
Perhaps you can better explain what you'd like to happen. Is your goal to combine all case samples into a single column? Or do you just want to group the case samples and the control samples? For the grouping while keeping each sample distinct, you could do
library(ggplot2)
plot_bar(physeq11,fill="Rank1") +
facet_wrap(~Status, scales = "free_x")
or just
plot_bar(physeq11,fill="Rank1", facet_grid = . ~ Status)
as shown in https://joey711.github.io/phyloseq/plot_bar-examples.html
Thank you, I am trying to replot the below fig using phyloseq
I used the below line: plot_bar(physeq11,fill="Rank1",x="Status")
Maybe I have to use transform_sample_counts for case and control separate. I look at phyeloseq function and could not figure out how to do it.
It sounds like you want to combine the samples with the same "Status" to get one barplot per Status?
To recreate the plot, you'd need to tell us how what was done to combine the samples. What makes the most sense is probably to set the abundance of a taxon in the group to its mean proportion in the group. One way to do this is (shown with the GlobalPatterns
example dataset, first agglomerated to Phylum for speed when calling plot_bar
)
library(phyloseq)
data(GlobalPatterns)
ps <- tax_glom(GlobalPatterns, "Phylum")
ps0 <- transform_sample_counts(ps, function(x) x / sum(x))
ps1 <- merge_samples(ps0, "SampleType")
ps2 <- transform_sample_counts(ps1, function(x) x / sum(x))
plot_bar(ps2, fill="Phylum")
The merge_samples
function sums the abundances of samples with the sample SampleType
. I first convert to proportions to avoid weighing samples by their read depth. Then, we need to convert to proportions again, since the total abundance of each SampleType in ps1
will equal the number of samples that were merged. Note, this masks the variation between samples and should go with some visualization of the variation within groups as well.
Waw, it works. Many thanks mikemc
Hi Mikemc,
Thanks for the code below. I wa shaving the same issue than FADHL; it totally worked
library(phyloseq) data(GlobalPatterns) ps <- tax_glom(GlobalPatterns, "Phylum") ps0 <- transform_sample_counts(ps, function(x) x / sum(x)) ps1 <- merge_samples(ps0, "SampleType") ps2 <- transform_sample_counts(ps1, function(x) x / sum(x)) plot_bar(ps2, fill="Phylum")
But I have one more challenge I could not resolve yet. How could I do if I want to show only the top 5 taxa in the barplot (not the full list)? I appreciate your help.
Kind regrads
There are various ways to do this. I personally would do it as follows, with tidyverse
functions.
library(dplyr)
library(forcats)
library(ggplot2)
df <- psmelt(ps2)
top_phyla <- df %>%
group_by(Sample, Phylum) %>%
summarize(Mean = mean(Abundance)) %>%
arrange(-Mean)
top_phyla
#> # A tibble: 594 x 3
#> # Groups: Sample [9]
#> Sample Phylum Mean
#> <chr> <fct> <dbl>
#> 1 Sediment (estuary) Proteobacteria 0.710
#> 2 Freshwater (creek) Cyanobacteria 0.626
#> 3 Feces Firmicutes 0.548
#> 4 Tongue Proteobacteria 0.445
#> 5 Mock Firmicutes 0.414
#> 6 Ocean Proteobacteria 0.380
#> 7 Freshwater Actinobacteria 0.371
#> 8 Feces Bacteroidetes 0.352
#> 9 Soil Proteobacteria 0.327
#> 10 Skin Proteobacteria 0.320
#> # … with 584 more rows
top5 <- top_phyla$Phylum[1:5]
df0 <- df %>%
mutate(Phylum = fct_other(Phylum, top5))
ggplot(df0, aes(Sample, Abundance, fill = Phylum)) +
geom_col()
Edit: fixed top5_phyla
-> top5
in the call to mutate. Thanks to @MaristelaMorais for catching the original mistake.
It sounds like you want to combine the samples with the same "Status" to get one barplot per Status? To recreate the plot, you'd need to tell us how what was done to combine the samples. What makes the most sense is probably to set the abundance of a taxon in the group to its mean proportion in the group. One way to do this is (shown with the
GlobalPatterns
example dataset, first agglomerated to Phylum for speed when callingplot_bar
)library(phyloseq) data(GlobalPatterns) ps <- tax_glom(GlobalPatterns, "Phylum") ps0 <- transform_sample_counts(ps, function(x) x / sum(x)) ps1 <- merge_samples(ps0, "SampleType") ps2 <- transform_sample_counts(ps1, function(x) x / sum(x)) plot_bar(ps2, fill="Phylum")
The
merge_samples
function sums the abundances of samples with the sampleSampleType
. I first convert to proportions to avoid weighing samples by their read depth. Then, we need to convert to proportions again, since the total abundance of each SampleType inps1
will equal the number of samples that were merged. Note, this masks the variation between samples and should go with some visualization of the variation within groups as well.
After doing so, my metadata turn into NA for non-numerical values. Any pre data type conversion needed?
Phyloseq's merge_samples()
function does not handle non-numerical metadata unfortunately. You need to create the new metadata table yourself if you want metadata besides the variable that you use for the merging (which is kept in the new sample names)
Hi Mike,
I'm using this issue to plot the most abundant taxa from my data, and it is being very helpful to make it. There is a small error on the following command using tidyverse functions:
top5 <- top_phyla$Phylum[1:5] df0 <- df %>% mutate(Phylum = fct_other(Phylum, top5_phyla)) ggplot(df0, aes(Sample, Abundance, fill = Phylum)) + geom_col()
For mutate function the entrance should be the factor top5 and not top5_phyla, and then it works perfectly.
Thanks for catching my mistake @MaristelaMorais - it is now fixed
Hi mikemc,
I have a quick question for you, given the code above/below
library(phyloseq) data(GlobalPatterns) ps <- tax_glom(GlobalPatterns, "Phylum") ps0 <- transform_sample_counts(ps, function(x) x / sum(x)) ps1 <- merge_samples(ps0, "SampleType") ps2 <- transform_sample_counts(ps1, function(x) x / sum(x)) plot_bar(ps2, fill="Phylum")
Can you help me understand a bit more why I need to use transform_sample_counts twice? I had previously calculated relative abundance of the top 5 taxa using the transform_sample_count function once, but when I compare the results to your code (w one modification) I end up with very different values. Per my code, the Y-axis goes from 0-15 but with your code, it goes from 0-0.6. I am slightly confused as to the drastic difference so I just want to make I understand the reasoning behind running the function transform_sample_counts twice.
If you or someone can please help me understand this I would really appreciate it.
#My old script
GenAglomU <- tax_glom(physeqUnburn, "Genus")#Agglomerate by taxa GenTop5U<-names(sort(taxa_sums(GenAglomU), TRUE)[1:5])#get top Gen5pruneU<-prune_taxa(GenTop5U, physeqUnburn) #Prune Gen5RAU <- transform_sample_counts(Gen5pruneU, function(x) x / sum(x))#Rel Abund Gen5TSFU <- merge_samples(Gen5RAU, "TSFdays") #Merge by variable of interest
#Your script modified
ps <- tax_glom(physeqBurn, "Genus") ps0<-names(sort(taxa_sums(ps), TRUE)[1:5]) #get most abundant ones ps1<-prune_taxa(ps0, physeqBurn) ps2 <- transform_sample_counts(ps1, function(x) x / sum(x)) ps3 <- merge_samples(ps2, "TSFdays") ps4 <- transform_sample_counts(ps3, function(x) x / sum(x)) ps5 <- psmelt(ps4)
On another note
found this code online that gives me the same result as your code did, which shows me that I definitely did it wrong, but given your code and the modified code below, I just want to make sure I understand what my code is doing (which I thought I did, but it seems like I am a tad lost).**
#Plot barplot of relative abundance per online post-------------------------------------------------------- GenusB <- tax_glom(physeqBurn, taxrank = 'Genus') # agglomerate taxa (GenTsfB = merge_samples(GenusB, "TSFdays")) # merge samples on sample variable of interest RelGenTsfB <- transform_sample_counts(GenTsfB, function(x) x/sum(x)) #get rel abundance RelGenTsf1B <- psmelt(RelGenTsfB) # create dataframe from phyloseq object RelGenTsf1B$Genus <- as.character(RelGenTsf1B$Genus) #convert to character RelGenTsf1B$Genus[RelGenTsf1B$Abundance < 0.03] <- "< 3% abund." #rename genera with < 3% abundance
@pulidofabs
Can you help me understand a bit more why I need to use transform_sample_counts twice?
The merging operation simply sums the abundances. If you want the abundances to sum to 1 prior to plotting, it is necessary to transform the abundances to proportions after the merging operation (hence the 2nd call to transform_sample_counts). So then why did I also transform to proportions before merging samples? I did that because I wanted to end up with an unweighted average of the proportions across samples within a sample group. You can certainly skip this step; in that case you would be taking a weighted average of the proportions across samples, where each sample is weighted by its read depth. Either way is reasonable.
There are various ways to do this. I personally would do it as follows, with
tidyverse
functions.library(dplyr) library(forcats) library(ggplot2) df <- psmelt(ps2) top_phyla <- df %>% group_by(Sample, Phylum) %>% summarize(Mean = mean(Abundance)) %>% arrange(-Mean) top_phyla #> # A tibble: 594 x 3 #> # Groups: Sample [9] #> Sample Phylum Mean #> <chr> <fct> <dbl> #> 1 Sediment (estuary) Proteobacteria 0.710 #> 2 Freshwater (creek) Cyanobacteria 0.626 #> 3 Feces Firmicutes 0.548 #> 4 Tongue Proteobacteria 0.445 #> 5 Mock Firmicutes 0.414 #> 6 Ocean Proteobacteria 0.380 #> 7 Freshwater Actinobacteria 0.371 #> 8 Feces Bacteroidetes 0.352 #> 9 Soil Proteobacteria 0.327 #> 10 Skin Proteobacteria 0.320 #> # … with 584 more rows top5 <- top_phyla$Phylum[1:5] df0 <- df %>% mutate(Phylum = fct_other(Phylum, top5)) ggplot(df0, aes(Sample, Abundance, fill = Phylum)) + geom_col()
Edit: fixed
top5_phyla
->top5
in the call to mutate. Thanks to @MaristelaMorais for catching the original mistake.
Thanks @mikemc
Just in case if anyone getting the error in "summarize(Mean = mean(Abundance))" it's because it need to change to "summarise" see https://github.com/tidyverse/dplyr/issues/505
Hi Mike- Could you please explain how I can add the metadata after using the merge_sample function? I am having the same issue, the grouping variable disappears after using merge_sample function, which I had to use to fix the rel. abundance proportions being greater than 1.
Phyloseq's
merge_samples()
function does not handle non-numerical metadata unfortunately. You need to create the new metadata table yourself if you want metadata besides the variable that you use for the merging (which is kept in the new sample names)
There isn't an easy way to do with phyloseq; you need to manually create the new sample data for non-numeric variables. I created a function merge_samples2()
in speedyseq to overcome this limitation which you might want to try.
Waw, it works. Many thanks mikemc
Hi, Is there an efficient way to rearrange the X axis (not alphabetically) in this bar plot. for example in order of - Control , Airswab, Case
Thanks
Hi @mikemc Thank you for your scripts and insights. I have a small question on similar lines. I have a dataset with 3 groups, each with 3 time points. So as to plot the top 10 for each time point I did the following but it gives me a plot with relative abundance above 100 percent for one group. How is this possible? Is this method I am using wrong? The code used:
PHY_SP2_genus <- tax_glom(ps.infant_t2, taxrank = "Genus") PHY_SP2_genus_abun <- prune_taxa(names(sort(taxa_sums(PHY_SP1_genus), TRUE)[1:10]), PHY_SP1_genus) PHY_SP2_genus_abun mrg_t2 <- merge_samples(PHY_SP2_genus_abun, "Group")
otu_table(mrg_t2) <- otu_table(mrg_t2)[, ]/as.matrix(table(sample_data(ps.infant_t2)$Group))[, 1] p_t2 <- plot_bar(mrg_t2) c_t2 <- p_t2 + geom_bar(aes(fill = Genus), stat = "identity", position = "stack") + scale_fill_brewer(palette = "Set3", "Genus") + ylab("Relative abundance (%)") + xlab("") + theme_bw()
c_t2 + plot_annotation(title = "t2 top 10 relative abundance (%)") & theme(plot.title = element_text(hjust = 0.5))
The plot looks like this:
Any help would be great. Best DP
It sounds like you want to combine the samples with the same "Status" to get one barplot per Status? To recreate the plot, you'd need to tell us how what was done to combine the samples. What makes the most sense is probably to set the abundance of a taxon in the group to its mean proportion in the group. One way to do this is (shown with the
GlobalPatterns
example dataset, first agglomerated to Phylum for speed when callingplot_bar
)library(phyloseq) data(GlobalPatterns) ps <- tax_glom(GlobalPatterns, "Phylum") ps0 <- transform_sample_counts(ps, function(x) x / sum(x)) ps1 <- merge_samples(ps0, "SampleType") ps2 <- transform_sample_counts(ps1, function(x) x / sum(x)) plot_bar(ps2, fill="Phylum")
The
merge_samples
function sums the abundances of samples with the sampleSampleType
. I first convert to proportions to avoid weighing samples by their read depth. Then, we need to convert to proportions again, since the total abundance of each SampleType inps1
will equal the number of samples that were merged. Note, this masks the variation between samples and should go with some visualization of the variation within groups as well.
Hi all, I have some comments to this: sum(X) is not the sample number it is the sum of all values. In the first case it is the sum the reads within each sample and in the second case it is the sum of the abundances: If it would be the sample number you would calculate the mean and not the proportion.
And I am quite sure that this is mathematically also the wrong approch because for genera/species which are not present in all samples the mean is calculated wrong. See these examples:
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
A | B | C | Mean | -- | -- | -- | -- | -- 0.397 | 0.284 | 0.482 | 0.388 | 0.221 | 0.118 | | 0.170 | 0.382 | 0.598 | 0.518 | 0.499 | 1 | 1 | 1 | 1.057 | Sum
The max relative abundance for samples are 1, but when I plot bar plot for case vs control is greater than one. What do you suggest? physeq11 = transform_sample_counts(physeq1 , function(OTU) OTU/sum(OTU) ) physeq11 = filter_taxa(physeq11, function(x) mean(x) > 1e-2, TRUE) plot_bar(physeq11,fill="Rank1")
plot_bar(physeq11,fill="Rank1",x="Status")
I do not know if I have to recount the RA for case and control separate. I appreciate any thoughts.