joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
584 stars 187 forks source link

compute gap statistic on (user-specified) ordination axes #99

Closed joey711 closed 11 years ago

joey711 commented 12 years ago

Should optionally create a plot summarizing the results.

This will be borrowed

An implementation of the gap statistic algorithm from Tibshirani, Walther, and Hastie's "Estimating the number of clusters in a data set via the gap statistic". A description of the algorithm can be found here

I wrote to the author of the version on GitHub that I found (and adapted for a private example), in the form of a license issue on his (Edwin Chen's) gap statistic development page.

As soon as this is clear, it should be possible to write an adaptation for ordination results, and formally assessing numbers of clusters.

joey711 commented 12 years ago

This was cleared by the author, but, there are several alternative implementations already available in common R packages, most notably the clusGap function in the cluster package. cluster has many reverse dependencies, and has Martin Maechler as its lead author, so is probably reliable.

Need to test that it works well for this purpose, can be adapted to make the nice ggplot2 output like with Edwin Chen's implementation.

If so, make a nice wrapper for it to use with phyloseq data.

joey711 commented 12 years ago

(footnote: Might have also considered lga::gap, but the lga package is officially listed as ORPHANED on CRAN at the moment. Not a good idea to include it as a dependency...

joey711 commented 12 years ago

This is straight-forward enough it makes sense just to show some example code. Maybe even adapt the ggplot2 plotting used by Edwin Chen. Post the example on the wiki and in a vignette, and consider it done. Unclear why a formal wrapper is necessary.

joey711 commented 12 years ago

Here is an example:

I will try to work this into a fancier Rmd/html example and post the link. The following code gives you plots that are useful and shows a relatively straightforward result when two clusters is a pretty good answer.

library("phyloseq")
# Load data
data(enterotype)
# ordination
ent.dca <- ordinate(enterotype)
# Plot
plot_ordination(enterotype, ent.dca, color="Enterotype")
plot_ordination(enterotype, ent.dca, color="SeqTech")

# Gap Statistic: How many clusters are there?
pam1 <- function(x,k) list(cluster = pam(x,k, cluster.only=TRUE))
x <- scores(ent.dca)
# gskmn <- clusGap(x[, 1:2], FUN=kmeans, nstart=20, K.max = 6, B = 500)
gskmn <- clusGap(x[, 1:2], FUN=pam1, K.max = 6, B = 500)
plot(gskmn, main = "Gap statistic for the 'Enterotypes' data")
mtext("k = 2 is best ... but  k = 3  pretty close")

The clusGap function does a pretty good job implementing this. Unclear why a formal wrapper is needed, especially considering that you already support the scores function, extended from vegan-package, to expose the coordinates of an ordination.

joey711 commented 12 years ago

Actually, the scores extensions are hidden as internal functions to avoid conflicts with vegan, now and in the future. You could name a different function that wraps scores, but is a convoluted answer. Alternatively, could include a simple wrapper of the above code (already written). Although, there are many other reasons a user might want the coordinates. Maybe at least show an example in which plot_ordination returns the plot object, say p1, and show how to get the coordinates (plus covariates) from p1$data.

joey711 commented 11 years ago

An updated version of this gap statistic example is shown on the gap statistic tutorial page and in the more elaborate phyloseq demo.

Will add another dependency to phyloseq to include it, but as stated earlier, the "cluster" package is a good one.

joey711 commented 11 years ago

Closed by pull 217