Open amueller opened 6 years ago
I'm currently working on KMeans with him. I made a benchmark showing that but I'm having doubts because the timings seems way too long for the sizes I used. I'm making more benchmarks. I'll let you know if I get a confirmation.
edit: confirmed. I'm making the plots.
Here are some benchmarks on k-means++ vs k_means. I kind of have to mitigate what I said. The initialization starts to take longer than the clustering only when the number of clusters get close to the number of samples. I'm not sure if it's common that n_sample ~ 10 * n_clusters.
in the first benchmark, the points are sample uniformly at random within [-1,1].
In the second benchmark, the points are generated using make_blobs.
I think (correct me if I'm wrong) that for most use case the ratio init / clustering is reasonably low.
Another thing I find interesting is the comparison between 'full' and 'elkan' algorithms. I though elkan would be faster.
here n_components is n_clusters? Also, can you add some more dimensions? 50 is still pretty low-dimensional. How about 500 and 5000?
Hm I recently posted a reference to a paper that did benchmarks on when which algorithm seems to work best: #10924 maybe see if your results are consistent?
Discussed with @ogrisel yesterday that it would be nice to subsample the data before applying kmeans++ if the data is big, as otherwise the initialization takes longer than the clustering. This needs some benchmarks but could drastically speed up the overall algorithm.