joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
584 stars 187 forks source link

relative abundance greater than one #1089

Open FADHLyemen opened 5 years ago

FADHLyemen commented 5 years ago

The max relative abundance for samples are 1, but when I plot bar plot for case vs control is greater than one. What do you suggest? physeq11 = transform_sample_counts(physeq1 , function(OTU) OTU/sum(OTU) ) physeq11 = filter_taxa(physeq11, function(x) mean(x) > 1e-2, TRUE) plot_bar(physeq11,fill="Rank1")

image

plot_bar(physeq11,fill="Rank1",x="Status") image

I do not know if I have to recount the RA for case and control separate. I appreciate any thoughts.

mikemc commented 5 years ago

By setting x = "Status", you are going to be combining samples with the same Status. I suggest reading the documentation for ggplot's geom_bar function and https://r4ds.had.co.nz/data-visualisation.html to understand more what is happening here.

Perhaps you can better explain what you'd like to happen. Is your goal to combine all case samples into a single column? Or do you just want to group the case samples and the control samples? For the grouping while keeping each sample distinct, you could do

library(ggplot2)
plot_bar(physeq11,fill="Rank1") +
  facet_wrap(~Status, scales = "free_x")

or just

plot_bar(physeq11,fill="Rank1", facet_grid = . ~ Status)

as shown in https://joey711.github.io/phyloseq/plot_bar-examples.html

FADHLyemen commented 5 years ago

Thank you, I am trying to replot the below fig using phyloseq image

I used the below line: plot_bar(physeq11,fill="Rank1",x="Status")

Maybe I have to use transform_sample_counts for case and control separate. I look at phyeloseq function and could not figure out how to do it.

mikemc commented 5 years ago

It sounds like you want to combine the samples with the same "Status" to get one barplot per Status? To recreate the plot, you'd need to tell us how what was done to combine the samples. What makes the most sense is probably to set the abundance of a taxon in the group to its mean proportion in the group. One way to do this is (shown with the GlobalPatterns example dataset, first agglomerated to Phylum for speed when calling plot_bar)

library(phyloseq)
data(GlobalPatterns)
ps <- tax_glom(GlobalPatterns, "Phylum")
ps0 <- transform_sample_counts(ps, function(x) x / sum(x))
ps1 <- merge_samples(ps0, "SampleType")
ps2 <- transform_sample_counts(ps1, function(x) x / sum(x))
plot_bar(ps2, fill="Phylum")

The merge_samples function sums the abundances of samples with the sample SampleType. I first convert to proportions to avoid weighing samples by their read depth. Then, we need to convert to proportions again, since the total abundance of each SampleType in ps1 will equal the number of samples that were merged. Note, this masks the variation between samples and should go with some visualization of the variation within groups as well.

FADHLyemen commented 5 years ago

Waw, it works. Many thanks mikemc image

provirasanz commented 5 years ago

Hi Mikemc,

Thanks for the code below. I wa shaving the same issue than FADHL; it totally worked

library(phyloseq) data(GlobalPatterns) ps <- tax_glom(GlobalPatterns, "Phylum") ps0 <- transform_sample_counts(ps, function(x) x / sum(x)) ps1 <- merge_samples(ps0, "SampleType") ps2 <- transform_sample_counts(ps1, function(x) x / sum(x)) plot_bar(ps2, fill="Phylum")

But I have one more challenge I could not resolve yet. How could I do if I want to show only the top 5 taxa in the barplot (not the full list)? I appreciate your help.

Kind regrads

mikemc commented 5 years ago

There are various ways to do this. I personally would do it as follows, with tidyverse functions.

library(dplyr)
library(forcats)
library(ggplot2)

df <- psmelt(ps2)

top_phyla <- df %>%
    group_by(Sample, Phylum) %>%
    summarize(Mean = mean(Abundance)) %>%
    arrange(-Mean)
top_phyla
#> # A tibble: 594 x 3
#> # Groups:   Sample [9]
#>    Sample             Phylum          Mean
#>    <chr>              <fct>          <dbl>
#>  1 Sediment (estuary) Proteobacteria 0.710
#>  2 Freshwater (creek) Cyanobacteria  0.626
#>  3 Feces              Firmicutes     0.548
#>  4 Tongue             Proteobacteria 0.445
#>  5 Mock               Firmicutes     0.414
#>  6 Ocean              Proteobacteria 0.380
#>  7 Freshwater         Actinobacteria 0.371
#>  8 Feces              Bacteroidetes  0.352
#>  9 Soil               Proteobacteria 0.327
#> 10 Skin               Proteobacteria 0.320
#> # … with 584 more rows
top5 <- top_phyla$Phylum[1:5]
df0 <- df %>%
    mutate(Phylum = fct_other(Phylum, top5))
ggplot(df0, aes(Sample, Abundance, fill = Phylum)) +
    geom_col()

Edit: fixed top5_phyla -> top5 in the call to mutate. Thanks to @MaristelaMorais for catching the original mistake.

lixiaopi1985 commented 4 years ago

It sounds like you want to combine the samples with the same "Status" to get one barplot per Status? To recreate the plot, you'd need to tell us how what was done to combine the samples. What makes the most sense is probably to set the abundance of a taxon in the group to its mean proportion in the group. One way to do this is (shown with the GlobalPatterns example dataset, first agglomerated to Phylum for speed when calling plot_bar)

library(phyloseq)
data(GlobalPatterns)
ps <- tax_glom(GlobalPatterns, "Phylum")
ps0 <- transform_sample_counts(ps, function(x) x / sum(x))
ps1 <- merge_samples(ps0, "SampleType")
ps2 <- transform_sample_counts(ps1, function(x) x / sum(x))
plot_bar(ps2, fill="Phylum")

The merge_samples function sums the abundances of samples with the sample SampleType. I first convert to proportions to avoid weighing samples by their read depth. Then, we need to convert to proportions again, since the total abundance of each SampleType in ps1 will equal the number of samples that were merged. Note, this masks the variation between samples and should go with some visualization of the variation within groups as well.

After doing so, my metadata turn into NA for non-numerical values. Any pre data type conversion needed?

mikemc commented 4 years ago

Phyloseq's merge_samples() function does not handle non-numerical metadata unfortunately. You need to create the new metadata table yourself if you want metadata besides the variable that you use for the merging (which is kept in the new sample names)

MaristelaMorais commented 4 years ago

Hi Mike,

I'm using this issue to plot the most abundant taxa from my data, and it is being very helpful to make it. There is a small error on the following command using tidyverse functions:

top5 <- top_phyla$Phylum[1:5] df0 <- df %>% mutate(Phylum = fct_other(Phylum, top5_phyla)) ggplot(df0, aes(Sample, Abundance, fill = Phylum)) + geom_col()

For mutate function the entrance should be the factor top5 and not top5_phyla, and then it works perfectly.

mikemc commented 4 years ago

Thanks for catching my mistake @MaristelaMorais - it is now fixed

pulidofabs commented 4 years ago

Hi mikemc,

I have a quick question for you, given the code above/below

library(phyloseq)
data(GlobalPatterns)
ps <- tax_glom(GlobalPatterns, "Phylum")
ps0 <- transform_sample_counts(ps, function(x) x / sum(x))
ps1 <- merge_samples(ps0, "SampleType")
ps2 <- transform_sample_counts(ps1, function(x) x / sum(x))
plot_bar(ps2, fill="Phylum")

Can you help me understand a bit more why I need to use transform_sample_counts twice? I had previously calculated relative abundance of the top 5 taxa using the transform_sample_count function once, but when I compare the results to your code (w one modification) I end up with very different values. Per my code, the Y-axis goes from 0-15 but with your code, it goes from 0-0.6. I am slightly confused as to the drastic difference so I just want to make I understand the reasoning behind running the function transform_sample_counts twice.

If you or someone can please help me understand this I would really appreciate it.

#My old script

GenAglomU <- tax_glom(physeqUnburn, "Genus")#Agglomerate by taxa GenTop5U<-names(sort(taxa_sums(GenAglomU), TRUE)[1:5])#get top Gen5pruneU<-prune_taxa(GenTop5U, physeqUnburn) #Prune Gen5RAU <- transform_sample_counts(Gen5pruneU, function(x) x / sum(x))#Rel Abund Gen5TSFU <- merge_samples(Gen5RAU, "TSFdays") #Merge by variable of interest

#Your script modified

ps <- tax_glom(physeqBurn, "Genus") ps0<-names(sort(taxa_sums(ps), TRUE)[1:5]) #get most abundant ones ps1<-prune_taxa(ps0, physeqBurn) ps2 <- transform_sample_counts(ps1, function(x) x / sum(x)) ps3 <- merge_samples(ps2, "TSFdays") ps4 <- transform_sample_counts(ps3, function(x) x / sum(x)) ps5 <- psmelt(ps4)

On another note

found this code online that gives me the same result as your code did, which shows me that I definitely did it wrong, but given your code and the modified code below, I just want to make sure I understand what my code is doing (which I thought I did, but it seems like I am a tad lost).**

#Plot barplot of relative abundance per online post-------------------------------------------------------- GenusB <- tax_glom(physeqBurn, taxrank = 'Genus') # agglomerate taxa (GenTsfB = merge_samples(GenusB, "TSFdays")) # merge samples on sample variable of interest RelGenTsfB <- transform_sample_counts(GenTsfB, function(x) x/sum(x)) #get rel abundance RelGenTsf1B <- psmelt(RelGenTsfB) # create dataframe from phyloseq object RelGenTsf1B$Genus <- as.character(RelGenTsf1B$Genus) #convert to character RelGenTsf1B$Genus[RelGenTsf1B$Abundance < 0.03] <- "< 3% abund." #rename genera with < 3% abundance

mikemc commented 4 years ago

@pulidofabs

Can you help me understand a bit more why I need to use transform_sample_counts twice?

The merging operation simply sums the abundances. If you want the abundances to sum to 1 prior to plotting, it is necessary to transform the abundances to proportions after the merging operation (hence the 2nd call to transform_sample_counts). So then why did I also transform to proportions before merging samples? I did that because I wanted to end up with an unweighted average of the proportions across samples within a sample group. You can certainly skip this step; in that case you would be taking a weighted average of the proportions across samples, where each sample is weighted by its read depth. Either way is reasonable.

kusandeep commented 3 years ago

There are various ways to do this. I personally would do it as follows, with tidyverse functions.

library(dplyr)
library(forcats)
library(ggplot2)

df <- psmelt(ps2)

top_phyla <- df %>%
    group_by(Sample, Phylum) %>%
    summarize(Mean = mean(Abundance)) %>%
    arrange(-Mean)
top_phyla
#> # A tibble: 594 x 3
#> # Groups:   Sample [9]
#>    Sample             Phylum          Mean
#>    <chr>              <fct>          <dbl>
#>  1 Sediment (estuary) Proteobacteria 0.710
#>  2 Freshwater (creek) Cyanobacteria  0.626
#>  3 Feces              Firmicutes     0.548
#>  4 Tongue             Proteobacteria 0.445
#>  5 Mock               Firmicutes     0.414
#>  6 Ocean              Proteobacteria 0.380
#>  7 Freshwater         Actinobacteria 0.371
#>  8 Feces              Bacteroidetes  0.352
#>  9 Soil               Proteobacteria 0.327
#> 10 Skin               Proteobacteria 0.320
#> # … with 584 more rows
top5 <- top_phyla$Phylum[1:5]
df0 <- df %>%
    mutate(Phylum = fct_other(Phylum, top5))
ggplot(df0, aes(Sample, Abundance, fill = Phylum)) +
    geom_col()

Edit: fixed top5_phyla -> top5 in the call to mutate. Thanks to @MaristelaMorais for catching the original mistake.

Thanks @mikemc

Just in case if anyone getting the error in "summarize(Mean = mean(Abundance))" it's because it need to change to "summarise" see https://github.com/tidyverse/dplyr/issues/505

freixas84 commented 3 years ago

Hi Mike- Could you please explain how I can add the metadata after using the merge_sample function? I am having the same issue, the grouping variable disappears after using merge_sample function, which I had to use to fix the rel. abundance proportions being greater than 1.

Phyloseq's merge_samples() function does not handle non-numerical metadata unfortunately. You need to create the new metadata table yourself if you want metadata besides the variable that you use for the merging (which is kept in the new sample names)

mikemc commented 3 years ago

There isn't an easy way to do with phyloseq; you need to manually create the new sample data for non-numeric variables. I created a function merge_samples2() in speedyseq to overcome this limitation which you might want to try.

kusandeep commented 3 years ago

Waw, it works. Many thanks mikemc image

Hi, Is there an efficient way to rearrange the X axis (not alphabetically) in this bar plot. for example in order of - Control , Airswab, Case

Thanks

pdhrati02 commented 3 years ago

Hi @mikemc Thank you for your scripts and insights. I have a small question on similar lines. I have a dataset with 3 groups, each with 3 time points. So as to plot the top 10 for each time point I did the following but it gives me a plot with relative abundance above 100 percent for one group. How is this possible? Is this method I am using wrong? The code used:

PHY_SP2_genus <- tax_glom(ps.infant_t2, taxrank = "Genus") PHY_SP2_genus_abun <- prune_taxa(names(sort(taxa_sums(PHY_SP1_genus), TRUE)[1:10]), PHY_SP1_genus) PHY_SP2_genus_abun mrg_t2 <- merge_samples(PHY_SP2_genus_abun, "Group")

Normalize by number of samples in each sample type

otu_table(mrg_t2) <- otu_table(mrg_t2)[, ]/as.matrix(table(sample_data(ps.infant_t2)$Group))[, 1] p_t2 <- plot_bar(mrg_t2) c_t2 <- p_t2 + geom_bar(aes(fill = Genus), stat = "identity", position = "stack") + scale_fill_brewer(palette = "Set3", "Genus") + ylab("Relative abundance (%)") + xlab("") + theme_bw()

c_t2 + plot_annotation(title = "t2 top 10 relative abundance (%)") & theme(plot.title = element_text(hjust = 0.5))

The plot looks like this: image

Any help would be great. Best DP

termithorbor commented 2 years ago

It sounds like you want to combine the samples with the same "Status" to get one barplot per Status? To recreate the plot, you'd need to tell us how what was done to combine the samples. What makes the most sense is probably to set the abundance of a taxon in the group to its mean proportion in the group. One way to do this is (shown with the GlobalPatterns example dataset, first agglomerated to Phylum for speed when calling plot_bar)

library(phyloseq)
data(GlobalPatterns)
ps <- tax_glom(GlobalPatterns, "Phylum")
ps0 <- transform_sample_counts(ps, function(x) x / sum(x))
ps1 <- merge_samples(ps0, "SampleType")
ps2 <- transform_sample_counts(ps1, function(x) x / sum(x))
plot_bar(ps2, fill="Phylum")

The merge_samples function sums the abundances of samples with the sample SampleType. I first convert to proportions to avoid weighing samples by their read depth. Then, we need to convert to proportions again, since the total abundance of each SampleType in ps1 will equal the number of samples that were merged. Note, this masks the variation between samples and should go with some visualization of the variation within groups as well.

Hi all, I have some comments to this: sum(X) is not the sample number it is the sum of all values. In the first case it is the sum the reads within each sample and in the second case it is the sum of the abundances: If it would be the sample number you would calculate the mean and not the proportion.

And I am quite sure that this is mathematically also the wrong approch because for genera/species which are not present in all samples the mean is calculated wrong. See these examples:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

A | B | C | Mean |   -- | -- | -- | -- | -- 0.397 | 0.284 | 0.482 | 0.388 |   0.221 | 0.118 |   | 0.170 |   0.382 | 0.598 | 0.518 | 0.499 |   1 | 1 | 1 | 1.057 | Sum

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

A | B | C | Mean |   -- | -- | -- | -- | -- 0.397 | 0.284 | 0.482 | 0.388 |   0.221 | 0.118 | 0 | 0.113 |   0.382 | 0.598 | 0.518 | 0.499 |   1 | 1 | 1 | 1.000 | Sum

The basic problem ist that in the first example the mean is (0.221+0.118)/2 and in the second one which is to my understanding the right approach the mean is basically (0.221+0.118)/3.