Make kmeans++ subsample data for initialization on large n_samples or n_clusters

scikit-learn / scikit-learn

scikit-learn: machine learning in Python

https://scikit-learn.org

BSD 3-Clause "New" or "Revised" License

60.26k stars 25.43k forks source link

Make kmeans++ subsample data for initialization on large n_samples or n_clusters #11493

Open amueller opened 6 years ago

amueller commented 6 years ago

Discussed with @ogrisel yesterday that it would be nice to subsample the data before applying kmeans++ if the data is big, as otherwise the initialization takes longer than the clustering. This needs some benchmarks but could drastically speed up the overall algorithm.

jeremiedbb commented 6 years ago

I'm currently working on KMeans with him. I made a benchmark showing that but I'm having doubts because the timings seems way too long for the sizes I used. I'm making more benchmarks. I'll let you know if I get a confirmation.

edit: confirmed. I'm making the plots.

jeremiedbb commented 6 years ago

Here are some benchmarks on k-means++ vs k_means. I kind of have to mitigate what I said. The initialization starts to take longer than the clustering only when the number of clusters get close to the number of samples. I'm not sure if it's common that n_sample ~ 10 * n_clusters.

in the first benchmark, the points are sample uniformly at random within [-1,1].

In the second benchmark, the points are generated using make_blobs.

I think (correct me if I'm wrong) that for most use case the ratio init / clustering is reasonably low.

Another thing I find interesting is the comparison between 'full' and 'elkan' algorithms. I though elkan would be faster.

amueller commented 6 years ago

here n_components is n_clusters? Also, can you add some more dimensions? 50 is still pretty low-dimensional. How about 500 and 5000?

amueller commented 6 years ago

Hm I recently posted a reference to a paper that did benchmarks on when which algorithm seems to work best: #10924 maybe see if your results are consistent?