open-connectome-classes / StatConn-Spring-2015-Info

introductory material
18 stars 4 forks source link

When is kmeans useful? #68

Closed kristinmg closed 9 years ago

kristinmg commented 9 years ago

Is kmeans a good choice of clustering methods for any type of models? For example, for last homework I created a 10x10 adjacency matrix containing three clusters with no connections between clusters, but kmeans still sometimes failed when I ran it multiple times.

jtmatterer commented 9 years ago

Clustering using k-means is quite useful in lots of applications, although there are other options (see chapter 14, specifically section 14.3 in Elements of Statistical Learning). The reason for the failure does not lie with k-means, but in clustering the rows(columns) of an adjacency matrix. All that a clustering algorithm does is partition the data into clusters. Whether or not these were the clusters we're interested in requires additional analysis.

We can see this in the homework example. Clustering the adjacency matrix of a graph does not produce (in general) sets of vertices, which are in some way more connected to themselves than to the rest of the graph, but rather sets of vertices that share similar connectivity properties. This is an interesting characteristic of a graph and in some cases may be what we're interested in. As an example, consider a complete bipartite graph. Performing clustering on the adjacency matrix with 2 clusters should produce as clusters the two partite sets, each of which is an independent set! So, while the clustering failed to find the type of clusters of vertices we talked about in class (like the double star graph Joshua wrote on the board yesterday), it found interesting structure nonetheless.

gkiar commented 9 years ago

One thing to add to what @kurtosis312 wrote is that kmeans partitions the data based on the optimal centroids it can find in the data. If we are trying to cluster the graph based on connectivity, then this may almost certainly be the wrong method to use. Many such connectivity based clustering algorithms also exist, however (see wiki linked @ the bottom for a description of a few)

http://en.wikipedia.org/wiki/Cluster_analysis