Closed joey711 closed 11 years ago
This was cleared by the author, but, there are several alternative implementations already available in common R packages, most notably the clusGap function in the cluster package. cluster
has many reverse dependencies, and has Martin Maechler as its lead author, so is probably reliable.
Need to test that it works well for this purpose, can be adapted to make the nice ggplot2 output like with Edwin Chen's implementation.
If so, make a nice wrapper for it to use with phyloseq data.
This is straight-forward enough it makes sense just to show some example code. Maybe even adapt the ggplot2 plotting used by Edwin Chen. Post the example on the wiki and in a vignette, and consider it done. Unclear why a formal wrapper is necessary.
I will try to work this into a fancier Rmd/html example and post the link. The following code gives you plots that are useful and shows a relatively straightforward result when two clusters is a pretty good answer.
library("phyloseq")
# Load data
data(enterotype)
# ordination
ent.dca <- ordinate(enterotype)
# Plot
plot_ordination(enterotype, ent.dca, color="Enterotype")
plot_ordination(enterotype, ent.dca, color="SeqTech")
# Gap Statistic: How many clusters are there?
pam1 <- function(x,k) list(cluster = pam(x,k, cluster.only=TRUE))
x <- scores(ent.dca)
# gskmn <- clusGap(x[, 1:2], FUN=kmeans, nstart=20, K.max = 6, B = 500)
gskmn <- clusGap(x[, 1:2], FUN=pam1, K.max = 6, B = 500)
plot(gskmn, main = "Gap statistic for the 'Enterotypes' data")
mtext("k = 2 is best ... but k = 3 pretty close")
The clusGap
function does a pretty good job implementing this. Unclear why a formal wrapper is needed, especially considering that you already support the scores
function, extended from vegan-package, to expose the coordinates of an ordination.
Actually, the scores
extensions are hidden as internal functions to avoid conflicts with vegan, now and in the future. You could name a different function that wraps scores
, but is a convoluted answer. Alternatively, could include a simple wrapper of the above code (already written). Although, there are many other reasons a user might want the coordinates. Maybe at least show an example in which plot_ordination
returns the plot object, say p1
, and show how to get the coordinates (plus covariates) from p1$data
.
An updated version of this gap statistic example is shown on the gap statistic tutorial page and in the more elaborate phyloseq demo.
Will add another dependency to phyloseq to include it, but as stated earlier, the "cluster" package is a good one.
Closed by pull 217
Should optionally create a plot summarizing the results.
This will be borrowed
An implementation of the gap statistic algorithm from Tibshirani, Walther, and Hastie's "Estimating the number of clusters in a data set via the gap statistic". A description of the algorithm can be found here
I wrote to the author of the version on GitHub that I found (and adapted for a private example), in the form of a license issue on his (Edwin Chen's) gap statistic development page.
As soon as this is clear, it should be possible to write an adaptation for ordination results, and formally assessing numbers of clusters.