Open ghost opened 9 years ago
That's normal, kmeans places the initial seeds (cluster centers) randomly. So each run will have a different initial set of seed locations, and as such (slightly) different outcomes. See for a nice introduction to k-means and clustering: http://web.cs.sunyit.edu/~mike/cs542/Jain50YearsBeyondKmeans.pdf
Thanks for the literature. However, this behaviour should be explicitly mentioned somewhere, because in other tools (i.e., R, Weka) the default k-means implementation can handle such cases.
How does R and Weka handle it? Do they use the same random seed for each run?
In R you can pass "centers" which is either the number of clusters (which will result in similar undeterministic behavior) or actual initial, distinct, cluster centers (in case, I believe but not actually checked, it will behave deterministic). I don't know about weka.
You could modify the kmeans function so instead of saying this.centroids = this.randomCentroids(...) you could pass the centroids in as an argument. That should allow different runs to produce the same results.
Often K Means is run multiple times and there is an error measurement calculated as the mean square distance of each point to the cluster centroid to which it belongs. You can then use the clustering result that minimizes this error as your centroids.
Result A: [1, 3, 16], [53, 79, 97], [351] Result B: [1, 3, 16, 53], [79, 97], [351]