Genus labels incorrection with plot_heatmap (ggplot2 bug)

joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:

http://joey711.github.io/phyloseq/

584 stars 187 forks source link

Genus labels incorrection with plot_heatmap (ggplot2 bug) #192

Closed dragonflywingz closed 11 years ago

dragonflywingz commented 11 years ago

Hello, I noticed when I add genus labels to my heat map they are labelled incorrectly. I do have some OTUs where the genus is "NA". When I label the heat map with Orders it is fine. I am guessing because and all of my OTUs are assigned at the order level.

It would be best if the unknown Genus names could be labelled "NA"

Here is my code:

test <- plot_heatmap (denoised_fil, sample.label = "Sample", species.label="Genus")
test + theme(axis.text.x = element_text(size=11, angle=90, hjust=1), axis.text.y=element_text(size=11))

I would like to be able to control the order of the samples and taxa if possible. Do you know of a way to do this?

Thanks, Amber

joey711 commented 11 years ago

Amber, can you provide data, or post code reproducing this problem with included example data? For instance, the GlobalPatterns dataset definitely has some missing genera classifications (NA), and so you should be able to reproduce the problem you are describing with that dataset without sending/posting the data you had stored as denoised_fil in your example code above.

Without the complete code it will be more difficult for me to reproduce and then fix your issue.

Thanks for the feedback! Alway appreciated!

joey

joey711 commented 11 years ago

Second point,

If you read the documentation for plot_heatmap it explains how the dimensions of the heatmap are organized. They are arranged according to the results of an ordination, which you choose, but you appear to have used the defaults, which were nonmetric multidimensional scaling of the bray-curtis distance. For a good explanation about this, see the article describing the "NeatMap" package for R. I borrowed/adapted their approach for this function.

The main point to remember is that color is a difficult aesthetic for the human brain to interpret on a continuous scale, especially for large complicated datasets; and because of this (and also the sheer size/complexity of many of the datasets where this is used) a heatmap is only really useful when the dimensions are organized according to natural clustering in the data. That is, leveraging structure that is already in your data. The extra step by plot_heatmap and the NeatMap package is to use ordination methods to do this organizing instead of the traditional approach, which is hierarchical clustering. In some cases hclust still works well, but it suffers from a few drawbacks that are explained well in the NeatMap article.

The point being, most "manually" organized dimensions for a heatmap will make the data it represents less-interpretable. With that in mind, if you still really want to re-organize the dimensions according to some variable, I do have some example code laying around that someone else asked for a while back and I'd be happy to dig it up and post it here.

Hope that helps! and let me know if you want to see the example. It takes part of the output from plot_heatmap and then uses ggplot2 commands to remake/reorganize the panel graphic according to other variables in the data.

joey

dragonflywingz commented 11 years ago

Hi Joey,

Ok I put together a reproducible example with G.P. dataset.

library("phyloseq")
data("GlobalPatterns")
gpac <-subset_taxa(GlobalPatterns, Phylum == "Crenarchaeota")
gpac50 <-names(sort(taxa_sums(gpac),TRUE)[1:50])
gpac50<-prune_taxa(gpac50, gpac)
ntaxa(gpac50)
gpac50_hm<-plot_heatmap(gpac50)
gpac50_hm+ theme(axis.text.x = element_text(size=11, angle=90, hjust=1), axis.text.y=element_text(size=11))
tax_table(gpac50)
#for example 245697 has no assigned genus - 16th from the bottom

gpac50_hm1<-plot_heatmap(gpac50, taxa.label = "Genus")
gpac50_hm1+ theme(axis.text.x = element_text(size=11, angle=90, hjust=1), axis.text.y=element_text(size=11))

#now 245697 it is labelled Candidatus Nitrososphaera

gpac50_hm2 <-plot_heatmap (gpac50, "MDS","unifrac","SampleType",low = "#000033", high = "#FF3300")
gpac50_hm2 + theme(axis.text.x = element_text(size=11, angle=90, hjust=1), axis.text.y=element_text(size=11))

#245697 is at the very bottom

gpac50_hm3 <-plot_heatmap (gpac50, "MDS","unifrac","SampleType","Genus",low = "#000033", high = "#FF3300")
gpac50_hm3 + theme(axis.text.x = element_text(size=11, angle=90, hjust=1), axis.text.y=element_text(size=11))

#labelled Candidatus Nitrososphaera

With respect to your second note, I did not realize plot_heatmap applies default ordinations when there are no specified parameters. I wanted to explore my data in a heat map without applying any ordination at first. I would like the taxa labelled because it more informative than OTU numbers. I don't know how to label the OTUs with the vanilla R heatmap. I am hoping to present my data using the Neatmap method, and I agree with you that when organized based on an ordination it is much easier to visualize trends in the data.

It would be nice to label the taxa on the lowest available rank, but for now I will stick to "Order", since that is the lowest level available to all my core microbiome OTUs.

Thanks for the quick response! Amber

joey711 commented 11 years ago

Amber,

Thanks for the valuable feedback. This was not always the behavior of plot_heatmap, which is part of the reason I asked for a reproducible example. It looks ggplot2 has changed to recycle the labels when they are NA. It used to leave them blank, which is what I would have expected/wanted, and what plot_heatmap used to do.

Here is a "guts open" example of making a heatmap style tile plot. This should give you enough to get started with your own. The trick is the psmelt function, which I recently opened up to users just for these sorts of DIY plotting tasks:

library("phyloseq")
library("ggplot2")
data("GlobalPatterns")
gpac <-subset_taxa(GlobalPatterns, Phylum == "Crenarchaeota")
gpac50 <-names(sort(taxa_sums(gpac),TRUE)[1:50])
gpac50<-prune_taxa(gpac50, gpac)
gpm = psmelt(gpac50)
gpm$OTU = as(gpm$OTU, "character")
p = ggplot(gpm, aes(Sample, OTU, fill=Abundance)) + geom_tile()
p = p + theme(axis.text.x = element_text(angle = -90, hjust = 0))
# Define a vector of y labels
ylabvec = as(tax_table(gpac50)[, "Genus"], "character")
names(ylabvec) <- taxa_names(gpac50) 
p = p + scale_y_discrete(labels=ylabvec)
p

Let me know if you have any questions. You can see where the error is, and you can replace the missing value NA with a character "NA" to force the labels to be real. This example is just demonstrating the recycling bug in ggplot2.

Thanks! I will leave this issue open until I've fixed it. I needed to migrate plot_heatmap to use psmelt anyway.

joey

joey711 commented 11 years ago

Amber,

Sorry I didn't throw this in from the beginning. Here is the "fixed" version, an example of how I will implement it in the code. All NA values for axis labels will be replaced with "", which is a valid but empty character element. Graphically all those NA values will turn to blanks, but no strange re-ordering/recycling will occur. Note that you can run this example with any of your own data, naming your data physeq. You can also redefine the labels for ylabvec however you want. You can re-order the heatmap axes by re-ordering the data.frame, in this case called gpm. This is a simplified version of how I will implement the revised plot_heatmap function in the package. The other advantage is that you will be able to extract and manipulate the output's data to make other ggplot graphics if you want, for instance in the example below, take a look at p$data.

#
# load packages and data
#
library("phyloseq")
library("ggplot2")
data("GlobalPatterns")
gpac <-subset_taxa(GlobalPatterns, Phylum == "Crenarchaeota")
gpac50 <-names(sort(taxa_sums(gpac),TRUE)[1:50])
gpac50<-prune_taxa(gpac50, gpac)
physeq = gpac50
#
# Solution: make a properly labeled heatmap
# the order is arbitrary, and you can change it
#
# Melt approach.
gpm = psmelt(physeq)
gpm$OTU = as(gpm$OTU, "character")
p = ggplot(gpm, aes(Sample, OTU, fill=Abundance)) + geom_tile()
p = p + theme(axis.text.x = element_text(angle = -90, hjust = 0))
# Define a vector of y labels
ylabvec = as(tax_table(physeq)[, "Genus"], "character")
names(ylabvec) <- taxa_names(physeq) 
# Replace NA values with ""
ylabvec[is.na(ylabvec)] <- ""
# Replace the labels on the plot
p = p + scale_y_discrete(labels=ylabvec)
p

dragonflywingz commented 11 years ago

Joey,

Thanks for the code! This is so helpful very helpful :)

Amber

joey711 commented 11 years ago

Sure thing! I'm working on the revised version of plot_heatmap for the package right now. Since this is a bug I will attempt to push the fix into the release and devel versions on Bioconductor as well.

I'll keep this issue open until I've merged a fix, and I'll mention it here, too.

Thanks for the feedback, Amber. I'm sure others have run into this problem as well.

joey

dragonflywingz commented 11 years ago

Hello again, Just another quick question with respect to plot_heatmap: How is abundance calculated for plot_heatmap? When I create the psmelt heatmap (above method) the abundance levels are much different from the plot_heatmap. I need to be able to report actual abundances or the formula used to calculated the 'relative abundance' on the plot_heatmap graphic. Amber

joey711 commented 11 years ago

The color scale is shown on the plots and should be accurate. Let me know if you think that it isn't. If you want to actually report abundance values, you should not use a heatmap. Instead, plot_bar is a very good and flexible tool for this purpose. Heatmaps are a semi-quantitative tool (let's be honest). And here's why: the abundance values must be mapped to this strange, human-interpreted color space that we're calling our "color scale". It's impossible to "see" the values precisely. As in, possible for the computer to calculate, but literally impossible for viewers of your graphic to interpret precisely. The struggle then is to select a scale/transformation that emphasizes the patterns in the data that you are attempting to convey.

The default color scale is using what you can think of as a logarithmic transformation of the abundance values. This is actually explained in the documentation for plot_heatmap. See the argument called trans. The default argument to trans is log_trans(4). The log_trans function is from the scales package, so see ?scales::log_trans for more details. You can provide other transformations from the scales package, but in all cases these should be changing the way the color scale is setup, but not changing the values themselves... For instance, an abundance value of 0 should still look like whatever color is at the bottom of the scale, and an abundance value of 4000 should look like wherever that would be on the scale, keeping in mind that it might be a log-ish scale. You can play around with the scales and determine which one you think best represents the patterns in your data. Note that extreme (or just bad) choices for transformation can hide or misrepresent patterns in the data, effectively a kind of bias. Avoid those choices if you're worried about the apparent accuracy, but keep in mind that the colors in the generic/vanilla psmelt heatmap example above will not and should not be the same as the plot_heatmap default, even with the same low and high values, because the abundances are not being mapped to the color scale in the same way.

Does that make sense?

Thanks for the feedback. It is giving me more ideas for documentation to add in the phyloseq demo and the phyloseq tutorials. I can see why some of this would not be obvious, or even get confusing.

~Joey

joey711 commented 11 years ago

The update has been merged into the master. This issue is now fixed in the github-devel version of phyloseq.

montoyah commented 8 years ago

Hi Joey,

I've a similar problem but with bar_plot. The unresolved genera are reported as NA, and they're not showing up in the legend but the colored bars are in the plot. Was the solution for the heatmap issue reported in this thread also implemented for bar_plots?. If it was, how can I use it to convert the NAs into character so I can replace them with scale_fill_manual(), for example?

Thanks in advance.

NeginValizadegan commented 5 years ago

Hi all,

I made a heatmap based on phylum on the Y axis and species on the X axis. However, my taxa labels are so long (for some reason) that the labels are not visible at all.