joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
584 stars 187 forks source link

Can not create100% stacked bar plots with less abundant taxa #1004

Closed emankhalaf closed 5 years ago

emankhalaf commented 6 years ago

Hello, I would like to create a 100% stacked bar plot for taxa collapsed to the genus level. I did it by using R to calculate the relative abundance at genus level, then picking up the top 20 taxa and extract genera with rel-ab >1% , then move to excel and copy these values as % and group the rest in others column. Actually, it gives me 100% bar graphs, but in publications they are using stacked bar chart generated in R. I found two solutions from github; first (#494) generated the chart but not displayed as 100% barchart. The second (#901) gave me weird graph, only one column and genera are not displayed as a fill to the column, plus when displaying the dataframe content, I found it empty, just have columns' headings??

May any please help me to solve this problem. Best

MSMortensen commented 6 years ago

Hi,

I suggest you take a look at the preDA function in the DAtest package to group the low adundant genera. After that you should be able to either use plot_bar() from phyloseq or melt your data and use ggplot2 directly.

Regards, Martin

emankhalaf commented 6 years ago

I checked the package, it is mainly for differential abundance estimation. But, my question is simply how to generate 100% stacked bars of relative abundance % at a specific taxonomic level. My trial resulted in stacked bars but not 100%, so I did it manually on excel but still not the same quality of figures as in published articles.

MSMortensen commented 6 years ago

This kind of errors are generally caused by small annoying parsing errors somewhere in the code, so if you can share your code and data I can take a look. A link to the graphs you want the output to look like would also be a help.

emankhalaf commented 6 years ago

I used two different methods with different outputs. The first one using the following code:

Transform to relative abundance

silk_workingr <- transform_sample_counts(silk_working, function(x) x/sum(x))
# agglomerate taxa
glom <- tax_glom(silk_workingr, taxrank = 'Genus')

create dataframe from phyloseq object

dat <- psmelt(glom)

convert Genus to a character vector from a factor because R

dat$Genus <- as.character(dat$Genus)
library(plyr)

group dataframe by genus, calculate median rel. abundance

medians <- ddply(dat, ~Genus, function(x) c(median=median(x$Abundance)))

find genera whose rel. abund. is less than 1%

remainder <- medians[medians$median <= 0.01,]$Genus

change their name to "Remainder"

dat[dat$Genus %in% remainder,]$Genus <- 'Remainder'

boxplot

ggplot(dat,
       aes(x=Genus,
           y=Abundance)) + geom_boxplot() + coord_flip()
library(RColorBrewer)

set color palette to accommodate the number of genera

colourCount = length(unique(dat$Genus))
getPalette = colorRampPalette(brewer.pal(9, "Set2"))
p <- ggplot(data=dat, aes(x=FieldBlockReplicate, y=Abundance, fill=Genus))
p + geom_bar(aes(), stat="identity", position="stack") + scale_fill_manual(values=getPalette(colourCount)) + theme(legend.position="right") + guides(fill=guide_legend(nrow=8))

The second method as follow:

extract the taxonomic rank e.g.Genus

silk_genus = tax_glom(silk_working, "Genus") 
tax_table(silk_genus)

calculate relative abundances

silk_gen_ab = transform_sample_counts(silk_genus, function(x) x/sum(x))
silk_gen_ab

divide by the number of samples to get general relative abundance

taxa_sums(silk_gen_ab)/42 

sumf

sample_sums(silk_gen_ab) 

Get the 20 most abundant taxa

top20 = names(sort(taxa_sums(silk_genus), TRUE)[1:20])

silk_genus_top20 = prune_taxa(top20, silk_genus)

I selected top20 because I know my microbiome is dominated by few taxa and the rest are rare / very low relative abundance.

Then, I moved to excel picking up genera with relative abundance > 1% and create an extra column for others where the taxa rel abundance <1% were grouped

The wired thing is that the first method resulted in 5 genera and the second method resulted in 11 taxa??? I would be thankful if you shared your e.mail with me, so I could send you the files. Mine is eimanpharmacist@gmail.com Thank you very much Eman

MSMortensen commented 6 years ago

Hi,

You get diffent numbers of genera that is equal to or higher than 1% because the first method relies on median and the second uses mean.

I suggest the functions below, one uses a percentage threshold and the other a number of taxa, to do the filtering (directly on the phyloseq object):

Merge low abundance taxa

By percentage threshold

merge_low_abundance <- function(pobject, threshold=0.001){ transformed <- transform_sample_counts(pobject, function(x) x/sum(x)) otu.table <- as.data.frame(otu_table(transformed)) otu.list <- row.names(otu.table[rowMeans(otu.table) < threshold,]) merged <- merge_taxa(transformed, otu.list, 1) for (i in 1:dim(tax_table(merged))[1]){ if (is.na(tax_table(merged)[i,2])){ taxa_names(merged)[i] <- "Other" tax_table(merged)[i,1:7] <- "Other"} } return(merged) }

Anything not with top x

merge_less_than_top <- function(pobject, top=10){ transformed <- transform_sample_counts(pobject, function(x) x/sum(x)) otu.table <- as.data.frame(otu_table(transformed)) otu.sort <- otu.table[order(rowMeans(otu.table), decreasing = TRUE),] otu.list <- row.names(otu.sort[(top+1):nrow(otu.sort),]) merged <- merge_taxa(transformed, otu.list, 1) for (i in 1:dim(tax_table(merged))[1]){ if (is.na(tax_table(merged)[i,2])){ taxa_names(merged)[i] <- "Other" tax_table(merged)[i,1:7] <- "Other"} } return(merged) }

anabelvonjackowski commented 5 years ago

I am having the same issue as @emankhalaf. @MSMortensen how would I use your code and implement it into the plot_bar function?

MSMortensen commented 5 years ago

@avonjackowski The output of both the functions I have shown are a phyloseq object. Using the plot_bar function directly on that output should work without any problem.

It should look something like this:

ps.gen <- tax_glom(psdata, "Genus") ps.gen.top10 <- merge_less_than_top(ps.gen, top=10)

plot_bar(ps.gen.top.10, fill = "Genus")

anabelvonjackowski commented 5 years ago

@MSMortensen ok that makes a lot of sense - thank you!

However if I try to run:

merge_less_than_top <- function(all_data, top=10){
  transformed <- transform_sample_counts(all_data, function(x) x/sum(x))
  otu.table <- as.data.frame(otu_table(transformed))
  otu.sort <- otu.table[order(rowMeans(otu.table), decreasing = TRUE),]
  otu.list <- row.names(otu.sort[(top+1):nrow(otu.sort),])
  merged <- merge_taxa(transformed, otu.list, 1)
  for (i in 1:dim(tax_table(merged))[1]){
    if (is.na(tax_table(merged)[i,2])){
      taxa_names(merged)[i] <- "Other"
      tax_table(merged)[i,1:7] <- "Other"}
  }
  return(merged)
}

ps.gen <- tax_glom(all_data, "Class")
ps.gen.top10 <- merge_less_than_top(ps.gen, top=10)
plot_bar(ps.gen.top10, fill = "Class")

I get the error: Error in [<-(*tmp*, i, 1:7, value = "Other") : subscript out of bounds. What am I missing now?

MSMortensen commented 5 years ago

@avonjackowski How many taxonomical levels do you have in your tax_table? This seems to be because you do not have less than 7 levels in your tax table. You should edit the number of columns in this line: tax_table(merged)[i,1:7] <- "Other"}

anabelvonjackowski commented 5 years ago

@MSMortensen ok frustration makes one overlook things like that :( . I have 6 levels. I computed it and it worked really well. The one last thing I have to ask is, why some stacked bars get pulled to 1 or 100%, whereas others are not? Image

My code for the plot_bar function is as follows:

plot_bar(ps.gen.top10, fill = "Class", "Depth") +
  geom_bar(stat="identity") +
  facet_grid(~Location) +
  # facet_grid(~Substrate) +
  coord_flip() +
  ylab("Percentage of Sequences") + ylim(0, 1) +
  geom_bar(aes(fill = Class), stat = "identity", position = "stack")

THANK YOU SOOOO MUCH FOR ALL YOUR HELP!

MSMortensen commented 5 years ago

@avonjackowski I don't know why one of the bars doesn't reach 1. I suggest that you do not use ylim and your two geom_bar arguments does not add anything as plot_bar sets those.

anabelvonjackowski commented 5 years ago

@MSMortensen Alright did that. The reason why some bars do not go all the way to 1, is because even though I have "Sample" names, I try and plot the "Location"by"Depth". More than one sample can be at a given "Location". Thus I would need a mean of the relative abundance per "Location". Would I have to insert another piece of code before plotting to achieve this?

MSMortensen commented 5 years ago

@avonjackowski No that is the way it should be done. I cannot really tell you why it does not add completely.

ChrisTrivedi commented 4 years ago

Hi,

You get diffent numbers of genera that is equal to or higher than 1% because the first method relies on median and the second uses mean.

I suggest the functions below, one uses a percentage threshold and the other a number of taxa, to do the filtering (directly on the phyloseq object):

Merge low abundance taxa

By percentage threshold

merge_low_abundance <- function(pobject, threshold=0.001){ transformed <- transform_sample_counts(pobject, function(x) x/sum(x)) otu.table <- as.data.frame(otu_table(transformed)) otu.list <- row.names(otu.table[rowMeans(otu.table) < threshold,]) merged <- merge_taxa(transformed, otu.list, 1) for (i in 1:dim(tax_table(merged))[1]){ if (is.na(tax_table(merged)[i,2])){ taxa_names(merged)[i] <- "Other" tax_table(merged)[i,1:7] <- "Other"} } return(merged) }

Anything not with top x

merge_less_than_top <- function(pobject, top=10){ transformed <- transform_sample_counts(pobject, function(x) x/sum(x)) otu.table <- as.data.frame(otu_table(transformed)) otu.sort <- otu.table[order(rowMeans(otu.table), decreasing = TRUE),] otu.list <- row.names(otu.sort[(top+1):nrow(otu.sort),]) merged <- merge_taxa(transformed, otu.list, 1) for (i in 1:dim(tax_table(merged))[1]){ if (is.na(tax_table(merged)[i,2])){ taxa_names(merged)[i] <- "Other" tax_table(merged)[i,1:7] <- "Other"} } return(merged) }

Hi @MSMortensen

Thank you for the nice code here. I'm attempting to use it to merge any ASVs that are <1%. My output is from DAD2 if that makes a difference.

When I try to use: ps.gen <- tax_glom(Air_snow_16S_ps_NEW, "Genus", bad_empty=c(NA, "", " ", "\t")) ps.gen.1percent <- merge_low_abundance(ps.gen)

I get the error: Error in merge_taxa.indices.internal(x, eqtaxa, archetype) : invalid archetype provided.

Google and another Phyloseq thread led me to believe this might be due to NAs in the tax_table, however, it makes it past the tax_glom script (as long as I use the bad_empty=c(NA, "", " ", "\t" flag), and when I view the tax_table I'm not seeing any NAs. Any ideas what this might be about? Unfortunately, my knowledge is not strong enough to be able to problem-solve this myself.

MSMortensen commented 4 years ago

Hi Chris, I quick guess is that the function expects 7 columns in the tax table, but as you have merged to genus level, you will only have 6. You can edit the line: tax_table(merged)[i,1:7] <- "Other" to tax_table(merged)[i,1:6] <- "Other" But, since my comments the function has been implemented in the DAtest package (https://github.com/Russel88/DAtest) in the function preDA. I think this implementation is a bit more robust and easier to work with, so my recommendation would be to use that function (it is also easier to cite that way). Best, Martin

ChrisTrivedi commented 4 years ago

Hi Chris, I quick guess is that the function expects 7 columns in the tax table, but as you have merged to genus level, you will only have 6. You can edit the line: tax_table(merged)[i,1:7] <- "Other" to tax_table(merged)[i,1:6] <- "Other" But, since my comments the function has been implemented in the DAtest package (https://github.com/Russel88/DAtest) in the function preDA. I think this implementation is a bit more robust and easier to work with, so my recommendation would be to use that function (it is also easier to cite that way). Best, Martin

@MSMortensen Thank you for the prompt response!

Good catch, but based on your comment above to another user I had already changed that:

merge_low_abundance <- function(Air_snow_16S_ps_NEW_clean, threshold=0.05){
  transformed <- transform_sample_counts(Air_snow_16S_ps_NEW_clean, function(x) x/sum(x))
  otu.table <- as.data.frame(otu_table(transformed))
  otu.list <- row.names(otu.table[rowMeans(otu.table) < threshold,])
  merged <- merge_taxa(transformed, otu.list, 1)
  for (i in 1:dim(tax_table(merged))[1]){
    if (is.na(tax_table(merged)[i,2])){
      taxa_names(merged)[i] <- "Under 1%"
      tax_table(merged)[i,1:6] <- "Under 1%"}
  }
  return(merged)
}

I didn't realize this was now implemented into DAtest. I will go ahead and give it a shot. Thanks for your help. Cheers!