joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
579 stars 188 forks source link

Abundance-based heatmaps using ggplot2 #93

Closed joey711 closed 12 years ago

joey711 commented 12 years ago

This issue was inspired by a discussion with Kyle Bittinger (https://github.com/kylebittinger) about some of the additional things users might want to be able to do while exploring their data.

Some interesting issues regarding meaningful/appropriate noise-reduction and normalization are key here. Would be useful to think of some flexible options.

I think the following would work: http://had.co.nz/ggplot2/stat_bin2d.html

However, this isn't even a binning example, just mapping position to continuous color scale based on transformed abundance value.

joey711 commented 12 years ago

This might be a more direct, efficient approach...

http://had.co.nz/ggplot2/geom_rect.html

joey711 commented 12 years ago

Here's a cool direct example using NBA players data:

http://learnr.wordpress.com/2010/01/26/ggplot2-quick-heatmap-plotting/

Data Import FlowingData used last season’s NBA basketball statistics provided by databasebasketball.com, and the csv-file with the data can be downloaded directly from its website.

nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")

The players are ordered by points scored, and the Name variable converted to a factor that ensures proper sorting of the plot.

nba$Name <- with(nba, reorder(Name, PTS))

Whilst FlowingData uses heatmap function in the stats-package that requires the plotted values to be in matrix format, ggplot2 operates with dataframes. For ease of processing, the dataframe is converted from wide format to a long format.

The game statistics have very different ranges, so to make them comparable all the individual statistics are rescaled.

library(ggplot2)
nba.m <- melt(nba)
nba.m <- ddply(nba.m, .(variable), transform, rescale = rescale(value))

(p <- ggplot(nba.m, aes(variable, Name)) + geom_tile(aes(fill = rescale),
  colour = "white") + scale_fill_gradient(low = "white",  high = "steelblue"))

base_size <- 9
p + theme_grey(base_size = base_size) + labs(x = "",
    y = "") + scale_x_discrete(expand = c(0, 0)) +
    scale_y_discrete(expand = c(0, 0)) + opts(legend.position = "none",
    axis.ticks = theme_blank(), axis.text.x = theme_text(size = base_size *
        0.8, angle = 330, hjust = 0, colour = "grey50"))

Rescaling Update In preparing the data for the above plot all the variables were rescaled so that they were between 0 and 1.

Jim rightly pointed out in the comments (and I did not initally get it) that the heatmap-function uses a different scaling method and therefore the plots are not identical. Below is an updated version of the heatmap which looks much more similar to the original.

nba.s <- ddply(nba.m, .(variable), transform,
    rescale = scale(value))
last_plot() %+% nba.s
joey711 commented 12 years ago

Also, see otu_heatmap from

https://github.com/kylebittinger/qiimer/blob/master/R/otu_table.R

joey711 commented 12 years ago

Hey, instead of re-inventing the wheel, I just learned of the following package:

http://cran.r-project.org/web/packages/NeatMap/index.html

Perhaps we should find ways to adapt phyloseq to this...

http://www.biomedcentral.com/1471-2105/11/45

joey711 commented 12 years ago

This plot function has been added in version 1.1.10, via the following commit:

8dbc9f18c70606bebde4f75b46fb060629961868

This is now closed. An additional issue should be created to suggest alternative functions, updates, or bugfixes.

Also, see the wiki (in progress) describing the function: https://github.com/joey711/phyloseq/wiki/plot_heatmap