emankhalaf commented 6 years ago

Hello, I would like to create a 100% stacked bar plot for taxa collapsed to the genus level. I did it by using R to calculate the relative abundance at genus level, then picking up the top 20 taxa and extract genera with rel-ab >1% , then move to excel and copy these values as % and group the rest in others column. Actually, it gives me 100% bar graphs, but in publications they are using stacked bar chart generated in R. I found two solutions from github; first (#494) generated the chart but not displayed as 100% barchart. The second (#901) gave me weird graph, only one column and genera are not displayed as a fill to the column, plus when displaying the dataframe content, I found it empty, just have columns' headings??

May any please help me to solve this problem. Best

MSMortensen commented 6 years ago

Hi,

I suggest you take a look at the preDA function in the DAtest package to group the low adundant genera. After that you should be able to either use plot_bar() from phyloseq or melt your data and use ggplot2 directly.

Regards, Martin

emankhalaf commented 6 years ago

I checked the package, it is mainly for differential abundance estimation. But, my question is simply how to generate 100% stacked bars of relative abundance % at a specific taxonomic level. My trial resulted in stacked bars but not 100%, so I did it manually on excel but still not the same quality of figures as in published articles.

MSMortensen commented 6 years ago

This kind of errors are generally caused by small annoying parsing errors somewhere in the code, so if you can share your code and data I can take a look. A link to the graphs you want the output to look like would also be a help.

emankhalaf commented 6 years ago

I used two different methods with different outputs. The first one using the following code:

Transform to relative abundance

silk_workingr <- transform_sample_counts(silk_working, function(x) x/sum(x))

# agglomerate taxa
glom <- tax_glom(silk_workingr, taxrank = 'Genus')

create dataframe from phyloseq object

dat <- psmelt(glom)

convert Genus to a character vector from a factor because R

dat$Genus <- as.character(dat$Genus)

library(plyr)

group dataframe by genus, calculate median rel. abundance

medians <- ddply(dat, ~Genus, function(x) c(median=median(x$Abundance)))

find genera whose rel. abund. is less than 1%

remainder <- medians[medians$median <= 0.01,]$Genus

change their name to "Remainder"

dat[dat$Genus %in% remainder,]$Genus <- 'Remainder'

boxplot

ggplot(dat,
       aes(x=Genus,
           y=Abundance)) + geom_boxplot() + coord_flip()

library(RColorBrewer)

set color palette to accommodate the number of genera

colourCount = length(unique(dat$Genus))
getPalette = colorRampPalette(brewer.pal(9, "Set2"))

p <- ggplot(data=dat, aes(x=FieldBlockReplicate, y=Abundance, fill=Genus))
p + geom_bar(aes(), stat="identity", position="stack") + scale_fill_manual(values=getPalette(colourCount)) + theme(legend.position="right") + guides(fill=guide_legend(nrow=8))

The second method as follow:

extract the taxonomic rank e.g.Genus

silk_genus = tax_glom(silk_working, "Genus") 
tax_table(silk_genus)

calculate relative abundances

silk_gen_ab = transform_sample_counts(silk_genus, function(x) x/sum(x))
silk_gen_ab

divide by the number of samples to get general relative abundance

taxa_sums(silk_gen_ab)/42

sumf

sample_sums(silk_gen_ab)

Get the 20 most abundant taxa

top20 = names(sort(taxa_sums(silk_genus), TRUE)[1:20])

silk_genus_top20 = prune_taxa(top20, silk_genus)

I selected top20 because I know my microbiome is dominated by few taxa and the rest are rare / very low relative abundance.

Then, I moved to excel picking up genera with relative abundance > 1% and create an extra column for others where the taxa rel abundance <1% were grouped

The wired thing is that the first method resulted in 5 genera and the second method resulted in 11 taxa??? I would be thankful if you shared your e.mail with me, so I could send you the files. Mine is eimanpharmacist@gmail.com Thank you very much Eman

MSMortensen commented 6 years ago

Hi,

You get diffent numbers of genera that is equal to or higher than 1% because the first method relies on median and the second uses mean.

I suggest the functions below, one uses a percentage threshold and the other a number of taxa, to do the filtering (directly on the phyloseq object):

Merge low abundance taxa

By percentage threshold

merge_low_abundance <- function(pobject, threshold=0.001){ transformed <- transform_sample_counts(pobject, function(x) x/sum(x)) otu.table <- as.data.frame(otu_table(transformed)) otu.list <- row.names(otu.table[rowMeans(otu.table) < threshold,]) merged <- merge_taxa(transformed, otu.list, 1) for (i in 1:dim(tax_table(merged))[1]){ if (is.na(tax_table(merged)[i,2])){ taxa_names(merged)[i] <- "Other" tax_table(merged)[i,1:7] <- "Other"} } return(merged) }

Anything not with top x

merge_less_than_top <- function(pobject, top=10){ transformed <- transform_sample_counts(pobject, function(x) x/sum(x)) otu.table <- as.data.frame(otu_table(transformed)) otu.sort <- otu.table[order(rowMeans(otu.table), decreasing = TRUE),] otu.list <- row.names(otu.sort[(top+1):nrow(otu.sort),]) merged <- merge_taxa(transformed, otu.list, 1) for (i in 1:dim(tax_table(merged))[1]){ if (is.na(tax_table(merged)[i,2])){ taxa_names(merged)[i] <- "Other" tax_table(merged)[i,1:7] <- "Other"} } return(merged) }

anabelvonjackowski commented 5 years ago

I am having the same issue as @emankhalaf. @MSMortensen how would I use your code and implement it into the plot_bar function?

MSMortensen commented 5 years ago

@avonjackowski The output of both the functions I have shown are a phyloseq object. Using the plot_bar function directly on that output should work without any problem.

It should look something like this:

ps.gen <- tax_glom(psdata, "Genus") ps.gen.top10 <- merge_less_than_top(ps.gen, top=10)

plot_bar(ps.gen.top.10, fill = "Genus")

anabelvonjackowski commented 5 years ago

@MSMortensen ok that makes a lot of sense - thank you!

However if I try to run:

merge_less_than_top <- function(all_data, top=10){
  transformed <- transform_sample_counts(all_data, function(x) x/sum(x))
  otu.table <- as.data.frame(otu_table(transformed))
  otu.sort <- otu.table[order(rowMeans(otu.table), decreasing = TRUE),]
  otu.list <- row.names(otu.sort[(top+1):nrow(otu.sort),])
  merged <- merge_taxa(transformed, otu.list, 1)
  for (i in 1:dim(tax_table(merged))[1]){
    if (is.na(tax_table(merged)[i,2])){
      taxa_names(merged)[i] <- "Other"
      tax_table(merged)[i,1:7] <- "Other"}
  }
  return(merged)
}

ps.gen <- tax_glom(all_data, "Class")
ps.gen.top10 <- merge_less_than_top(ps.gen, top=10)
plot_bar(ps.gen.top10, fill = "Class")

I get the error: Error in [<-(*tmp*, i, 1:7, value = "Other") : subscript out of bounds. What am I missing now?

MSMortensen commented 5 years ago

@avonjackowski How many taxonomical levels do you have in your tax_table? This seems to be because you do not have less than 7 levels in your tax table. You should edit the number of columns in this line: tax_table(merged)[i,1:7] <- "Other"}

anabelvonjackowski commented 5 years ago

@MSMortensen ok frustration makes one overlook things like that :( . I have 6 levels. I computed it and it worked really well. The one last thing I have to ask is, why some stacked bars get pulled to 1 or 100%, whereas others are not?

My code for the plot_bar function is as follows:

plot_bar(ps.gen.top10, fill = "Class", "Depth") +
  geom_bar(stat="identity") +
  facet_grid(~Location) +
  # facet_grid(~Substrate) +
  coord_flip() +
  ylab("Percentage of Sequences") + ylim(0, 1) +
  geom_bar(aes(fill = Class), stat = "identity", position = "stack")

THANK YOU SOOOO MUCH FOR ALL YOUR HELP!

MSMortensen commented 5 years ago

@avonjackowski I don't know why one of the bars doesn't reach 1. I suggest that you do not use ylim and your two geom_bar arguments does not add anything as plot_bar sets those.

anabelvonjackowski commented 5 years ago

@MSMortensen Alright did that. The reason why some bars do not go all the way to 1, is because even though I have "Sample" names, I try and plot the "Location"by"Depth". More than one sample can be at a given "Location". Thus I would need a mean of the relative abundance per "Location". Would I have to insert another piece of code before plotting to achieve this?

MSMortensen commented 5 years ago

@avonjackowski No that is the way it should be done. I cannot really tell you why it does not add completely.

ChrisTrivedi commented 4 years ago

Hi,

You get diffent numbers of genera that is equal to or higher than 1% because the first method relies on median and the second uses mean.

I suggest the functions below, one uses a percentage threshold and the other a number of taxa, to do the filtering (directly on the phyloseq object):

Merge low abundance taxa

By percentage threshold

merge_low_abundance <- function(pobject, threshold=0.001){ transformed <- transform_sample_counts(pobject, function(x) x/sum(x)) otu.table <- as.data.frame(otu_table(transformed)) otu.list <- row.names(otu.table[rowMeans(otu.table) < threshold,]) merged <- merge_taxa(transformed, otu.list, 1) for (i in 1:dim(tax_table(merged))[1]){ if (is.na(tax_table(merged)[i,2])){ taxa_names(merged)[i] <- "Other" tax_table(merged)[i,1:7] <- "Other"} } return(merged) }

Anything not with top x

merge_less_than_top <- function(pobject, top=10){ transformed <- transform_sample_counts(pobject, function(x) x/sum(x)) otu.table <- as.data.frame(otu_table(transformed)) otu.sort <- otu.table[order(rowMeans(otu.table), decreasing = TRUE),] otu.list <- row.names(otu.sort[(top+1):nrow(otu.sort),]) merged <- merge_taxa(transformed, otu.list, 1) for (i in 1:dim(tax_table(merged))[1]){ if (is.na(tax_table(merged)[i,2])){ taxa_names(merged)[i] <- "Other" tax_table(merged)[i,1:7] <- "Other"} } return(merged) }

Hi @MSMortensen

Thank you for the nice code here. I'm attempting to use it to merge any ASVs that are <1%. My output is from DAD2 if that makes a difference.

When I try to use: ps.gen <- tax_glom(Air_snow_16S_ps_NEW, "Genus", bad_empty=c(NA, "", " ", "\t")) ps.gen.1percent <- merge_low_abundance(ps.gen)

I get the error: Error in merge_taxa.indices.internal(x, eqtaxa, archetype) : invalid archetype provided.

Google and another Phyloseq thread led me to believe this might be due to NAs in the tax_table, however, it makes it past the tax_glom script (as long as I use the bad_empty=c(NA, "", " ", "\t" flag), and when I view the tax_table I'm not seeing any NAs. Any ideas what this might be about? Unfortunately, my knowledge is not strong enough to be able to problem-solve this myself.

MSMortensen commented 4 years ago

Hi Chris, I quick guess is that the function expects 7 columns in the tax table, but as you have merged to genus level, you will only have 6. You can edit the line: tax_table(merged)[i,1:7] <- "Other" to tax_table(merged)[i,1:6] <- "Other" But, since my comments the function has been implemented in the DAtest package (https://github.com/Russel88/DAtest) in the function preDA. I think this implementation is a bit more robust and easier to work with, so my recommendation would be to use that function (it is also easier to cite that way). Best, Martin

ChrisTrivedi commented 4 years ago

Hi Chris, I quick guess is that the function expects 7 columns in the tax table, but as you have merged to genus level, you will only have 6. You can edit the line: tax_table(merged)[i,1:7] <- "Other" to tax_table(merged)[i,1:6] <- "Other" But, since my comments the function has been implemented in the DAtest package (https://github.com/Russel88/DAtest) in the function preDA. I think this implementation is a bit more robust and easier to work with, so my recommendation would be to use that function (it is also easier to cite that way). Best, Martin

@MSMortensen Thank you for the prompt response!

Good catch, but based on your comment above to another user I had already changed that:

merge_low_abundance <- function(Air_snow_16S_ps_NEW_clean, threshold=0.05){
  transformed <- transform_sample_counts(Air_snow_16S_ps_NEW_clean, function(x) x/sum(x))
  otu.table <- as.data.frame(otu_table(transformed))
  otu.list <- row.names(otu.table[rowMeans(otu.table) < threshold,])
  merged <- merge_taxa(transformed, otu.list, 1)
  for (i in 1:dim(tax_table(merged))[1]){
    if (is.na(tax_table(merged)[i,2])){
      taxa_names(merged)[i] <- "Under 1%"
      tax_table(merged)[i,1:6] <- "Under 1%"}
  }
  return(merged)
}

I didn't realize this was now implemented into DAtest. I will go ahead and give it a shot. Thanks for your help. Cheers!

joey711 / phyloseq

Can not create100% stacked bar plots with less abundant taxa #1004

Transform to relative abundance

create dataframe from phyloseq object

convert Genus to a character vector from a factor because R

group dataframe by genus, calculate median rel. abundance

find genera whose rel. abund. is less than 1%

change their name to "Remainder"

boxplot

set color palette to accommodate the number of genera

extract the taxonomic rank e.g.Genus

calculate relative abundances

divide by the number of samples to get general relative abundance

sumf

Get the 20 most abundant taxa

Merge low abundance taxa

By percentage threshold

Anything not with top x

Merge low abundance taxa

By percentage threshold

Anything not with top x