open-connectome-classes / StatConn-Spring-2015-Info

introductory material
18 stars 4 forks source link

Number of Clusters #74

Open rgrohit opened 9 years ago

rgrohit commented 9 years ago

We started our algorithm with the number of clusters as an input. Are there methods/algorithms to figure out how many clusters are in a graph?

kristinmg commented 9 years ago

I think hierarchical clustering can be useful if the number of clusters is not known. At each step the clustering algorithm selects the clusters to merge or split by optimizing a certain criterion on the data set. A stopping condition can be imposed on the algorithm to select the best clustering with respect to some quality measure on the current cluster set.

This article describes various methods for clustering graphs, if you are insterested: http://dollar.biz.uiowa.edu/~street/graphClustering.pdf

mblohr commented 9 years ago

An implementation of one approach to estimate the K in K-means for Gaussian-distributed data distributions: https://datasciencelab.wordpress.com/2013/12/27/finding-the-k-in-k-means-clustering/

maxcollard commented 9 years ago

One method that a member of my lab group was using was to, in essence, see how consistent the cluster labels are when you use multiple random samplings of the data. Intuitively, there is a tradeoff between how closely your model fits the data (it will always be better if you increase K) and how consistent your fit is when you use only a subset of your data (as K increases, the cluster labels / centroid locations will be wildly different if you use different subsamples of the data). By appropriately weighting these considerations, you can arrive at a K that is optimal in some sense.

ajulian3 commented 9 years ago

Specifically, if you use the KNN algorithm to mathematically optimize where the clusters. I remember learning something in Data Mining about setting the number of clusters (k=10) in order to maximize the optimality of the algorithm. Does anyone have any further information regarding this algorithm?

jovo commented 9 years ago

@greg - remind me to mention in class various approaches and caveats to choosing K.

On Thu, Feb 12, 2015 at 1:28 PM, ajulian3 notifications@github.com wrote:

Specifically, if you use the KNN algorithm to mathematically optimize where the clusters. I remember learning something in Data Mining about setting the number of clusters (k=10) in order to maximize the optimality of the algorithm. Does anyone have any further information regarding this algorithm?

— Reply to this email directly or view it on GitHub https://github.com/Statistical-Connectomics-Sp15/intro/issues/74#issuecomment-74124236 .

the glass is all full: half water, half air. openconnecto.me, we're hiring! https://docs.google.com/document/d/14SApYAzxF0Ddqg2ZCEwjmz3ht2TDhDmxyZI2ZP82_0U/edit?usp=sharing , jovo.me, my calendar https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York