open-connectome-classes / StatConn-Spring-2015-Info

introductory material
18 stars 4 forks source link

final number of clusters unknown? #93

Open edunnwe1 opened 9 years ago

edunnwe1 commented 9 years ago

For KMeans, you have to specify the number of clusters before running the clustering algorithm, which can be a limitation if you are not sure a priori how many clusters there should be. There are clustering algorithms that don't rely on this (e.g. super paramagnetic clustering), but it seems to me that the statistical decision theoretic relies on knowing the number of clusters beforehand, because otherwise, I don't know how you'd define your action space. Is there any way to work around this limitation within the statistical decision theoretic, supposing we're applying it to a clustering problem?

SandyaS72 commented 9 years ago

This might not be the kind of answer you were looking for, but for k-means at least, I've done it where you run it several times with different numbers of clusters and you measure some metric each time .. such as the ratio of (within cluster sum of squares) to the (between cluster sum of squares). As you increase the number of clusters, usually one of two things happens: 1) the metric becomes more and more "optimal" (defined differently depending on what metric you use, but for the example I gave, you want to make it smaller) but at some point it actually starts getting worse. You want to find the number of clusters that makes it most optimal. 2) Increasing the number of clusters continually makes the metric more optimal, but by less and less each time. You probably want to set some threshold that if the metric doesn't change by that much when adding a cluster, you'll stop.

Short answer- basically you can infer the number of clusters if you want as well.

maxcollard commented 9 years ago

Mathematically, there's a few ways to manipulate action space to get around not knowing the number of clusters a priori. Assuming we're working with n vertices:

1) Let A = {1,2,...}^n. The choice of k is implicit: for a particular a \in A, \hat{k} = \max_i (a_i), the largest cluster label assigned to anything. 2) Let A = { (k, X) : k \in {1,2,...}, X \in {1,2,...k}^n } \subset {1,2,...} \times {1,2,...}^n. This is a slightly smaller space than (1), but I can't think of any benefit to the specificity.

In practice, determining the number of clusters is hard. Sandya presents one method above; another method is to randomly subsample your data and see how robust the cluster labels are with each k (basically a drop-out analysis): this tells you essentially how much you are "overfitting" as you increase k. The general procedure is the same: you create some heuristic "loss function" with input k, and look for the minimum.

edunnwe1 commented 9 years ago

I guess this is in part the question Nam addressed in his presentation of his research?