ghamerly / fast-kmeans

Code to speed up k-means clustering. Originally at BaylorCS/baylorml.
http://cs.baylor.edu/~hamerly/software/kmeans.php
MIT License
52 stars 8 forks source link

multithreading center init #4

Open ghamerly opened 6 years ago

ghamerly commented 6 years ago

Migrated from https://github.com/BaylorCS/baylorml/issues/2 (@BenjaminHorn)

I have a fairly big dataset (100m * 10) , and as i calculated it would take around 8 hours to initialise the centers with init_centers_kmeanspp_v2. After some test i realised -that only one core does the work -most of the time is spent in this loop: https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L187

I have to admit i dont know much about multithreaded programming, but i think the loop could be split into the number of threads, to make it run parallel.

float sumDistribution(int from, int to, Dataset const &x, pair<double, int> *dist2)
{
    //here comes the loop
    return sum_distribution;
}

But those parallel running function have to read from the same dist2 array and x. Maybe this is why a cluster loop takes 5-6s, and it cant be run parallel, and fasten up. Before i start to dig into the topic i just wanted to ask your opinion.

Some other thing: why is https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L198 necessary?

            if (dist2[i].first > max_dist) {
                max_dist = dist2[i].first;
            }

As i can see max_dist wont be used anywhere.

ghamerly commented 6 years ago

The K-means++ initialization is inherently serial which is why you see only one core doing the work. At least, each major stage of the algorithm requires previous stages to complete (each iteration of https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L183).

However, the code inside that loop can be parallelized, especially the loop you mention at https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L187 It would not take too much work to do this. I probably won't have time in the next few weeks to do it, but if you want to send a pull request I'd be happy to look at it and incorporate it.

One thing I have tried before and works well but is not in this code is to apply the idea of the triangle inequality to this code. This avoids distance computations on subsequent iterations and makes it much faster for large k. I may have this code lying around somewhere.

Alternatively, there are other methods based on K-means++ that supposedly work faster in parallel, see here: http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf

As for the max_dist variable, I think you're right that it's unused. It's probably left over from some older profiling code.

ghamerly commented 6 years ago

From https://github.com/BaylorCS/baylorml/issues/2#issuecomment-221264049 (@kno10)

It may well be feasible to use random initialization. In my experiments, k-means++ only marginally improved runtime. It may however yield better clusters. This depends a lot on the data; but the 8 hours initialization time may not be worth the improvement.

A simpler—yet effective—strategy for huge data is to initialize using k-means on a sample (if I'm not mistaken k-means|| is similar to running k-means++ on a distributed sample?). I.e. randomly draw 1 million points (or even just 100k), run k-means++ and do just 3–5 iterations. Take the resulting centers as initial centers for your complete data set. Finding these centers should take maybe 10 minutes and will often yield better initial centers than k-means++? These centers may already be very good estimates.

In ELKI (not baylorml), on a 38 million 2d coordinates data set, k=100, ''on a single core and not on a dedicated benchmarking host''. For more reliable conclusions, this would need to be run on many data sets, with different ks, and many more seeds, on dedicated systems.

SortMeans with random intialization, seed 0: 420 s, 41 iterations, varsum 4.471356921340852E8 SortMeans with random intialization, seed 1: 1000 s, 102 iterations, varsum 3.494018158732119E8 SortMeans with random intialization, seed 2: 570 s, 47 iterations, varsum 3.690071545123777E8 SortMeans with k-means++ initialization, seed 0: 810 s, 77 iterations, varsum 3.7151307775374913E8 SortMeans with k-means++ initialization, seed 1: 774 s, 69 iterations, varsum 3.8855969067854965E8 SortMeans with k-means++ initialization, seed 2: 492 s, 39 iterations, varsum 3.374139162253591E8

So the main benefit of k-means++ is that it is less likely to get stuck in a bad local minimum. If we now initialize via k-means++ on a sample, we should get good results, at hopefully fewer iterations.

SortMeans initialized with 5it,km++,1%, seed 0: 624 s, 55 iter, varsum 3.538176164640555E8 SortMeans initialized with 5it,km++,1%, seed 1: 850 s, 85 iter, varsum 4.0328181944224614E8 SortMeans initialized with 5it,km++,1%, seed 2: 593 s, 56 iter, varsum 3.590163352246089E8

And, unfortunately, reality is much more complex... The performance is not bad, but also not substantially better. We need much larger experiments to even be able to tell what is "usually" better. Overall, lucky or bad random seems to be the largest factor.

The parameters used for the "sample k-means" combination ELKI:

-time
-algorithm clustering.kmeans.KMeansSort -kmeans.k 100
-kmeans.initialization SampleKMeansInitialization -kmeans.seed 0
-kmeans.algorithm KMeansSort
-kmeans.initialization KMeansPlusPlusInitialMeans -kmeans.seed 0
-kmeans.maxiter 5 -kmeans.samplesize 0.01
-evaluator clustering.LogClusterSizes -resulthandler DiscardResultHandler

If you want to experiment with different seeds, I have contributed a seed command in 27d970511f46321012674498d99538c37f6db651 To reproduce these sampling-based initialization in this C++ implementation, you would need to further extend the driver with e.g. a function to use precomputed centers for initialization.