joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
582 stars 187 forks source link

How to plot a relative abundance plot with the average percentage for abundance of the taxa? #1129

Open shrutimicrobiota opened 5 years ago

shrutimicrobiota commented 5 years ago

Hi guys, I am new to writing R scripts, therefore need a little help. I have a barplot showing the relative abundance of the topmost dominant bacterial phyla. I would like to know the average percentage of each phylum in my dataset. How could do that? Any help will be appreciated! Thanks :-)

ehickman0817 commented 5 years ago

I too would like help with this. I have read through issues #418 and #1005 but haven't been able to find what I am looking for. I can get the code in #418 to work and get this result:

summarize_taxa(filtered_data1, "Phylum")
            Phylum       meanRA         sdRA        minRA        maxRA
 1:     Firmicutes 1.136554e-03 1.757837e-02 5.226026e-06 9.748357e-01
 2: Actinobacteria 1.518721e-03 1.992568e-02 5.226026e-06 8.308060e-01
 3: Proteobacteria 1.195875e-03 2.002892e-02 5.226026e-06 7.089428e-01
 4:  Bacteroidetes 4.965731e-04 5.476752e-03 5.226026e-06 1.875852e-01
 5:   Fusobacteria 2.998574e-04 1.354886e-03 5.226026e-06 1.648079e-02
 6:  Acidobacteria 2.976221e-05 2.314770e-05 7.467312e-06 6.770538e-05
 7:    Chloroflexi 3.831400e-05 4.994198e-05 1.028066e-05 1.391487e-04
 8:            TM7 1.537751e-04 7.062572e-04 5.226026e-06 8.478775e-03
 9:       [Thermi] 2.054587e-04 5.285998e-04 8.233705e-06 1.876049e-03
10:            SR1 2.539156e-05 2.293604e-05 8.735074e-06 8.319468e-05
11:   Spirochaetes 3.354825e-05 3.055969e-05 7.767533e-06 1.215260e-04
12:  Synergistetes 2.068694e-05 1.425025e-05 9.025760e-06 4.418653e-05

But what I am looking for is to know what the average % abundance is of all samples together -- the number that would contribute to a bar graph such as this: image Where it appears, for example, that Actinobacteria makes up about 45% in the female samples.

Thanks!

mikemc commented 5 years ago

@ehickman0817 See my suggestion here; that seems to be what you are after.

mikemc commented 5 years ago

Another way to do it is, if you're familiar with dplyr, is

library(dplyr)
tb <- psmelt(physeq) %>%
  filter(!is.na(Phylum)) %>%
  group_by(Sex, Sample, Phylum) %>%
  # Add abundances within each phylum and sample
  summarize_at("Abundance", sum) %>%
  # Get phylum proportions within each sample
  summarize_at("Proportion", ~ . / sum(.)) %>%
  # Get the mean proportion each phylum across samples within the `Sex` (Male/Female) variable
  summarize_at("Proportion", mean)

I think you would get the same proportions, though you should check.

ehickman0817 commented 5 years ago

Thanks you so much! That's actually the post I used to help generate my graphs but I didn't think to use it for this. In case anyone else needs this, here is what I did:

ps3 <- tax_glom(filtered_data1, "Phylum")
ps3.top <- merge_less_than_top(ps3, top=4)
ps4 <- merge_samples(ps3.top, "Sex")
summarize_taxa(ps4)
           Phylum      meanRA         sdRA       minRA       maxRA
1: Actinobacteria 0.497401373 0.0924000714 0.432064656 0.562738090
2: Proteobacteria 0.118322140 0.0159394939 0.107051216 0.129593065
3:     Firmicutes 0.367329848 0.0872312474 0.305648042 0.429011655
4:  Bacteroidetes 0.013948907 0.0103055865 0.006661757 0.021236057
5:          Other 0.002997731 0.0004650833 0.002668868 0.003326595

My only remaining question is when I merge samples by sex, then what is the summarize mean indicating? Mean in males or females or both overall? When I perform the analysis only on ps3.top, I get this, which is slightly different:


summarize_taxa(ps3.top, "Phylum")
           Phylum     meanRA        sdRA        minRA      maxRA
1: Actinobacteria 0.50583192 0.248091642 9.166681e-03 0.96852494
2: Proteobacteria 0.11686783 0.206015368 1.212441e-03 0.90571009
3:     Firmicutes 0.35937091 0.242093337 2.936509e-02 0.98749769
4:  Bacteroidetes 0.01513327 0.038105555 1.512013e-05 0.23025235
5:          Other 0.00319475 0.007250804 4.506433e-05 0.03935353
ehickman0817 commented 5 years ago

Ah never mind, I figured it out. Instead of the code I wrote above, when I use this, I get what I want:

ps4 <- subset_samples(ps3.top, Sex=="Female")
ummarize_taxa(ps4, "Phylum")
           Phylum      meanRA        sdRA        minRA      maxRA
1: Actinobacteria 0.432064656 0.277851329 9.166681e-03 0.93190020
2: Proteobacteria 0.129593065 0.244960747 1.583222e-03 0.90571009
3:     Firmicutes 0.429011655 0.295017610 3.206690e-02 0.98749769
4:  Bacteroidetes 0.006661757 0.009330800 3.091286e-05 0.03310275
5:          Other 0.002771517 0.006881246 4.506433e-05 0.03174633
samd1993 commented 4 years ago

I actually like the result when you merge by "Sex" because the min has to be the relative abundance of one treatment and the max has to be the other treatment. Anyone know a way to differentiate the two when running summarize taxa?

ricrocha82 commented 4 years ago

Hi folks,

I was just working on this code to get a table with % abundance per group and I realized that it calculates the % based on all samples. So, I changed it to get the % based on the abundance in each group. Please, have a look and let me know if the code works?? Also, I included a arrange in the function if you want to sort the table by Abundance. Thanks

summarize_taxa = function(physeq, Rank, GroupBy = NULL, arrange = FALSE){
  Rank <- Rank[1]
  if(!Rank %in% rank_names(physeq)){
    message("The argument to `Rank` was:\n", Rank,
            "\nBut it was not found among taxonomic ranks:\n",
            paste0(rank_names(physeq), collapse = ", "), "\n",
            "Please check the list shown above and try again.")
  }
  if(!is.null(GroupBy)){
    GroupBy <- GroupBy[1]
    if(!GroupBy %in% sample_variables(physeq)){
      message("The argument to `GroupBy` was:\n", GroupBy,
              "\nBut it was not found among sample variables:\n",
              paste0(sample_variables(physeq), collapse = ", "), "\n",
              "Please check the list shown above and try again.")
    }
  }
  # Start with psmelt
  ps.melt <- physeq %>%
    tax_glom(taxrank = Rank) %>%                     # agglomerate at 'Rank' level
    psmelt() %>%                                         # Melt to long format
    arrange(Rank)                                     # arrange by 'Rank'
  if(!is.null(GroupBy)){
    columns <- c(GroupBy,Rank)
    # Add the variable indicated in `GroupBy`, if provided.
    summ.df <- ps.melt %>% 
      # group by Rank and Group
      group_by_at(vars(one_of(columns))) %>% 
      # summarise and create a column with the relative abundance
      summarise(Abundance = sum(Abundance)) %>%
      mutate(freq = paste0(round(100 * Abundance / sum(Abundance), 2), "%"))
  }else{# only with 'Rank'
    summ.df <- ps.melt %>%
      group_by_at(Rank) %>%
      summarise(Abundance = sum(Abundance)) %>%
      mutate(freq = paste0(round(100 * Abundance / sum(Abundance), 2), '%'))
  }
  if(arrange == FALSE){
    return(summ.df)
    }else{# sort data with the most abundant in the first line
    summ.df.ar <- summ.df %>% arrange(desc(Abundance))
    return(summ.df.ar)
  }
}
sven9r commented 3 years ago

Another way to do it is, if you're familiar with dplyr, is

library(dplyr)
tb <- psmelt(physeq) %>%
  filter(!is.na(Phylum)) %>%
  group_by(Sex, Sample, Phylum) %>%
  # Add abundances within each phylum and sample
  summarize_at("Abundance", sum) %>%
  # Get phylum proportions within each sample
  summarize_at("Proportion", ~ . / sum(.)) %>%
  # Get the mean proportion each phylum across samples within the `Sex` (Male/Female) variable
  summarize_at("Proportion", mean)

I think you would get the same proportions, though you should check.

Found the thread for my project I'm working on right now. This doesn't work for me at all. After the second " summarize_at" the operation fails. The other solutions wont help me neither.

shakthees commented 2 years ago

image

I have created my relative abundance plot. however, i feel that it is too messy and would like to speperate them by the variables (ie. "CF70", "NUS" etc). i subset the data and created the plot again but i keep getting all the genus in the legend instead of just the ones present in that particular catagory. i tried looking for an answer online to rectify it but couldnt really find an answer to my problem. could someone help me? i have pasted my codes below. in addition, i would also like to combine all the "unclassified" genus" into one just labelled "unclassified". thank you so much in advance! genus plot to check

Genus: D0

ps.sub.D0 <- subset_samples(ps, diet %in% c("D0"))

ps.rel.D0 = transform_sample_counts(ps.sub.D0, function(x) x/sum(x)*100)

agglomerate taxa

glom <- tax_glom(ps.rel.D0, taxrank = 'Genus', NArm = FALSE) ps.melt <- psmelt(glom)

change to character for easy-adjusted level

ps.melt$Genus <- as.character(ps.melt$Genus) ps.melt <- ps.melt %>% group_by(expid,diet, Genus) %>% mutate(median=median(Abundance))

to get the same rows together

ps.melt_sum <- ps.melt %>% group_by(expid,diet, Genus) %>% summarise(Abundance=mean(Abundance))

ggplot(ps.melt_sum, aes(x=expid, y = Abundance, fill = Genus)) + geom_bar(stat = "identity", aes(fill=Genus)) + labs(x="", y="Relative Abundance %") + facet_wrap(~diet, scales= "free_x", nrow=1) + theme_classic() + theme(legend.position = "right", legend.text = element_text(size = 7), strip.background = element_blank(), axis.text.x=element_blank())