harthur / clustering

K-means and hierarchical clustering
http://harthur.github.com/clustering/demos/colors
MIT License
500 stars 92 forks source link

different runs of k-means clustering result in different outputs #9

Open ghost opened 9 years ago

ghost commented 9 years ago
var colors = [
   [97],
   [1],
   [53],
   [79],
   [3],
   [351],
   [16]
];

var clusters = clusterfck.kmeans(colors, 3);

Result A: [1, 3, 16], [53, 79, 97], [351] Result B: [1, 3, 16, 53], [79, 97], [351]

bbroeksema commented 9 years ago

That's normal, kmeans places the initial seeds (cluster centers) randomly. So each run will have a different initial set of seed locations, and as such (slightly) different outcomes. See for a nice introduction to k-means and clustering: http://web.cs.sunyit.edu/~mike/cs542/Jain50YearsBeyondKmeans.pdf

ghost commented 9 years ago

Thanks for the literature. However, this behaviour should be explicitly mentioned somewhere, because in other tools (i.e., R, Weka) the default k-means implementation can handle such cases.

Ouwen commented 9 years ago

How does R and Weka handle it? Do they use the same random seed for each run?

bbroeksema commented 9 years ago

In R you can pass "centers" which is either the number of clusters (which will result in similar undeterministic behavior) or actual initial, distinct, cluster centers (in case, I believe but not actually checked, it will behave deterministic). I don't know about weka.

user24 commented 9 years ago

You could modify the kmeans function so instead of saying this.centroids = this.randomCentroids(...) you could pass the centroids in as an argument. That should allow different runs to produce the same results.

tayden commented 8 years ago

Often K Means is run multiple times and there is an error measurement calculated as the mean square distance of each point to the cluster centroid to which it belongs. You can then use the clustering result that minimizes this error as your centroids.