k-Means is very slow, scales badly with k

kno10 commented 8 years ago

This variant of k-means (which is overly complicated) scales very badly.

Some observations:

at the beginning of each iteration, the mean is recomputed (in if (!fixed_centers)). Afterwards, it is incrementally updated, but that data is never used. All use of mus in the second loop can be deleted.
in each iteration, each vector will be retrieved three times, and freed three times, where one should be enough ( get_feature_vector).

But that still does not explain my runtime measurements (using 3.2.0 - I only now realized that the version preinstalled with Linux distributions is much older - albeit there does not appear to be any relevant difference): k=10 5.91 seconds, k=100 137.009 seconds, k=1000 2409.55 seconds.

Where usually, the performance would increase by ~7.5 when increasing k by 10 (because of fewer iterations I guess), it increases by ~20 in Shogun with k, making it the slowest k-means I benchmarked except for flexclust (pure R), sklearn 0.14 (fixed since, 100x faster now), and scipy.cluster.vq. In these, the interpreter is probably to blame. For the record, the best K-means-variant I benchmarked on this data is 0.80, 3.76, 7.35; the best Lloyd implementation is 2.15, 42.16, 202.19... so Shogun is 12x slower than necessary, and better algorithms are over 300x faster.

Above observation (duplicate computation of centers) would explain a factor of 2 or 3 maybe. That would be enough to make k=10 competitive, but there is a superlinear cost with increasing k that I cannot discover in Shogun source code.

Note that #1263 suggests to drop legacy code like this completely.

kno10 commented 8 years ago

Update: I found another issue with this k-means. It is also initialized very badly. Every point is randomly assigned to a cluster, which will cause cluster centers to be very close to each other. The more common random initialization is to simply choose k random instances, and use these as centers. The k-means++ initialization does not appear to be available using the command line interface.

kno10 commented 8 years ago

Ouch. It also chooses n random instances during each iteration, instead of using each one time? That only means it randomly ignores about 1/3rd of instances in each iteration: const int32_t Pat=CMath::random(0, XSize-1);

vigsterkr commented 8 years ago

@kno10 yeah indeed when i started some benchmarking of shogun, i've started with k-means and currently i do agree that's it's in a rather horrible state. :(

kno10 commented 8 years ago

For k-means, I've made a pull request #2988 with some of these changes, but I have not tried to compile it, nor tested or benchmarked it. But this may save you some work.

I don't know if anybody in Shogun has much interest in k-means, so #1263 may have a point.

Hierarchical clustering also benchmarked rather slow. It runs out of memory at 50000 and took 32 seconds for 20000 instances. The better implementations all do 20000 in 3 seconds and scale to 500000 instances without problems.

iglesias commented 8 years ago

Hi @kno10, thanks a lot for your analysis and the comments.

Regarding the first observation in your first comment, why do you say that mus is never used in the second loop (the for loop that starts after the block if (!fixed_centers) { ... } ends)?

kno10 commented 8 years ago

mus is only modified, but never read. All distances are computed to the copy in rhs_mus. The algorithm terminates when the centers did not change anymore (then mus has not changed either in the second loop). At the beginning of the next iteration, mus is set to zero again in line 53. Thus, the only case when these updated values are used is if max_iter is hit. Then the updated values from that last iteration are returned - but that is not any better than returning the previous centers to which the points were last assigned (neither is a consistent solution, because it has not converged).

iglesias commented 8 years ago

I think (perhaps I am getting it wrong) that mus is both read and written. Note that in lines 110 and 124 the operator used is -=. It is not probably not the most readable way of doing this :-)

Does that make sense to you?

kno10 commented 8 years ago

Its modifed, but not used afterwards. There are no "side effects" beyond updating mus, are there?

See: it does all this effort to update the means. And then in the next iteration, it is reset to 0, and recomputed from scratch. Does that make sense to you?

iglesias commented 8 years ago

It does, because fixed_centers can be true. However, I do agree with you that all the work done to update mus is a waste in case fixed_centers is false -- which may be the most common use case.

I think we should break up this method into two methods, and use either of them depending on the value of fixed_centers.

It is great that you pointed this out! Thanks a lot.

kno10 commented 8 years ago

fixed_centers makes no sense, because if the centers are fixed, iterating is nonsense - it's not k-means anymore. fixed_centers should be handled by the NN classifier instead, every point is assigned the nearest center as label.

iglesias commented 8 years ago

@kno10, can you share the code you used for the benchmark results you mentioned above? A gist or similar would work fine.

karlnapf commented 8 years ago

Thanks a lot @kno10 for investigating this. I totally am for dropping such disasters, as they are doing more bad than good.

In fact, I am thinking about a GSoC project that is just about going through our "simple" ML algorithms, benchmarking them a bit, identifying problems (as you did here), and then either dropping or re-writing.

karlnapf commented 8 years ago

@kno10 I just had a look. If you are interested, you could attempt a re-write that solves the above problems. This might be a nice entrance task for the mentioned GSoC project as well

kno10 commented 8 years ago

Sorry, my focus is on ELKI. I used Shogun solely for comparison.

karlnapf commented 8 years ago

No worries. Thanks! I used your report in a GSoC project proposal here: https://github.com/shogun-toolbox/shogun/wiki/GSoC_2016_project_fundamental_usual_suspects

Hopefully we resolve a few of those during the summer!

Saurabh7 commented 8 years ago

hi guys, ran some timers on handy clustering datasets, results (seconds) : The shogun(new) is after updating some of the changes discussed above

remove random sampling of instances
remove unnecessary updating of centers (which would mean we have to handle fixed centers separately )

instances = ~5000

k	10	100	1000
scikit	0.09	0.58	3.565
mlpack	0.08	0.493	3.00
shogun	0.18	1.61	8.19
shogun (new)	0.09	0.99	4.96

instances = ~20,000

k	10	100
scikit	2.65	32.53
mlpack	1.14	127.68
shogun	3.07	428.86
shogun (new)	1.66	276.89

What do you guys think? I had already attempted a broader restructuring of KMeans class in #2648 #2815 but we were stuck because apparently there were not enough unit tests.

karlnapf commented 8 years ago

Good to see you back here @Saurabh7

Thanks a lot for this example. These results are quite extreme. Can you post the scripts that generated them somewhere? (Your own public repo for example). I would like to have a look at them.

Do the methods produce the same results?

We now have to find out what makes the other implementations so fast (and copy the design). Then we can attempt to fix things.

The Shogun (new) version that is based on the discussions above seems already much faster, but still way beyond the others -- so maybe rather than re-structuring we can have a look at the other libs and attempt a re-implementation from scratch....after all KMeans is a dead simple algorithm.

One reason for speed differences might be multicore and vectorised linear algebra, which are relatively straight forward to put in our too.

When re-implementing/re-structuring, we should make use of linalg. @lambday can probably help here as well as he ran benchmarks before.

Saurabh7 commented 8 years ago

Thanks @karlnapf , I have posted the scripts at https://github.com/Saurabh7/shogun-benchmarks/tree/master/kmeans . You can reproduce quickly with the python scripts i guess. I haven't looked at the quality of the results yet.

As you said lot of things could be changed: using new SGVector, SGMatrix operators , linalg operations, etc. after which we can compare the results again.

One problem I find with changing the code by re-implementing/structuring is that parts of the current implementation (some of which are oddly different from std implementations in other libraries) are not documented/tested. So maybe we would have to address that first too.

karlnapf commented 8 years ago

Thanks! Some comments:

Sanity check: you build the shogun library with compiler optimizations enabled right? That is DBUILD_TYPE=Release ? and the script as well?

I agree with all the other points. Using std anywhere is not really necessary.

I agree with the documenting/testing. However, if the current code design is slow and needs to be replaced anyways, there is not really a need for this. Maybe as a first step, you could have a look into the sklearn implementation to check what they are doing from a high level perspective (same data-structures and operations?). This should give an answer to the question why it is so fast compared to other libs.

Saurabh7 commented 8 years ago

@karlnapf Yes I did build with DBUILD_TYPE=Release. Updating with a few more changes:

Random initialization : Now chooses k random instances as cluster centers rather than randomly assigning clusters. Then assign nearest points to cluster centers.
Used linalg to scale the mus matrix. (Used a hack here as scaling each row of matrix by different scalar is not possible)

instances = ~20,000

k	10	100
scikit	2.65	32.53
shogun	3.07	428.86
shogun_1	2.02	276.89
shogun_2	5.00	49.51

The increase for k=10 in the latest shogun_2 must be because of the new constant cost of computing distances during initialization.

Now some comparisons with scikit I came across:

Computing distances: For euclidean distances their version is much faster: dist(x, y) = sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y)) . If only one argument changes (as in case of kmeans) they use precomputed values of dot(x, x) using x_squared_norms.

But in shogun we use simple loops for (int32_t i=0; i<alen; i++) result+=CMath::sq(avec[i] - bvec[i]);

To mimic the performance of computing distances in kmeans with varying k i wrote some quick scripts https://github.com/Saurabh7/shogun-benchmarks/tree/master/distance, results(seconds) :

instances = ~ 20,000

k	10	100	1000
scikit	1.94	20.49	226.03
shogun	8.35	77.47	860

Convergence: They use a concept of inertia which is basically within-cluster sum of squares (WCSS) . When the change in it is less that some tolerance tol convergence is declared. In shogun change in each and every point's cluster is checked till there is no change. Maybe this could lead to unnecessary iterations even when the WCSS is not changing a lot.

karlnapf commented 8 years ago

Very good investigation @Saurabh7 I think we should focus on making our distance computation faster then since that seems to be a major game changer. AFAK our Gaussian kernel pre-computes the squared norms of the diagonal of the kernel matrix. Maybe we can change our CDistance class to do the same here.

But let's go very systematically:

Use a precomputed distance and benchmark against a precomputed distance in sklearn. This gives us an estimate on how much the distance computation makes a difference. There is the CCustomDistance in Shogun, dont know how to do this for sklearn, but you could use a hack. Based on the outcome of this, we can identify more potential bottlenecks.
use precomputed x_squared norms in Shogun's Euclidean distance. The CGaussianKernel has inspiration for this.
use linalg dot products to compute the distance rather than the loop over elements
Sometimes computed on multiple data. Look out for loops over distance calls. This should be done via a single call that uses dot products between matrices and then using linalg. We might need to add a method distance_multiple(SGVector indices_a, SGVector indicesb) to the Distance class for this.
Finally, we could parallelise on multi-cores. But let's keep that out for now

The other points, can you comment on each?

scaling every row/column in a matrix by a number should be added to linalg. @lambday can you have a look at this?
random init as you described sounds good. How does scikit do it?
the `shogun_2`` looks much better. But I did not get this "constant cost" reason.
Convergence. I agree, the WCSS is maybe a better idea. But let us only do this after the distance thing is solved.

kno10 commented 8 years ago

@karlnapf:

Precomputing distances does not work, because k-means does not use pairwise distances.
Precomputing squared norms does not help that much for dense data - and k-means is designed for dense data. Using dot products is great if you can exploit sparsity only; for dense data the memory bandwidth is usually the limit and it does not pay off to save the extra sub instructions.
Stopping early (with a WCSS threshold, e.g. tol) should not be the default. It causes non-intuitive results where a point is not assigned to the closest center, or the cluster center is not the mean of the assigned points (if both were the case, it would be converged). It can be good to approximate k-means if you need maximum speed, and for probabilistic approaches such as Mini-Batch-Kmeans it is the best that you can do. But I would default to running until convergence (and stop when no point is reassigned anymore), and keep this optional. @Saurabh7 for fairness, set tol=0 in sklearn and maxiter=10000 (a sufficiently large value) so sklearn also produces a converged result.

karlnapf commented 8 years ago

Thanks for your comments @kno10

You are right. But precomputing the distance allows to benchmark the algorithm itself, without the time spent computing distances. Therefore, it is instructive to hunt for bottlenecks. Especially if we cannot gain a lot via point 2.) we have to find another handle on speed.
True. Memory managment is key. @Saurabh7 maybe you can profile the memory usage to see what is going on, I saw a few "copy" commands on matrices when I had a look earlier. However, precomputing the squared norms helped in the Gaussian kernel case. Doesn't hurt for sure. And we have to start somewhere.
Agreed. @Saurabh7 maybe this can be made an option later (and turned on by default for the minibatch version)

Setting the sklearn tolerance is a good idea to compare

kno10 commented 8 years ago

Wrt. 1: What do you want to precompute? k-means does not use pairwise distances. And the means move around. So how would you precompute anything useful?

karlnapf commented 8 years ago

Ah crap, you are of course correct. Too many things going on at once. Precomputing is not useful at all. Thanks for the hint!

@Saurabh7 how are things going?

Saurabh7 commented 8 years ago

@karlnapf Regarding your previous points

yes scaling every row by a different scalar is not possible in linalg
Random init , yes scikit uses the same concept.
In the new initialization we have to compute distances to centers which was not present in previous version, that is what i meant by a new constant cost. I will have a look at this deeper.
Adding tol for convergence : yes we could make it optional or default to zero.

Regarding squared norms: This could again be a case by case basis, depending on the data as discussed: might help or might not help much, could use up more resources, etc. This could be made optional too. I will try to find a way to compare this.

Next I am trying to use the precomputed squared norms in CEuclideanDistance as you suggested also making changes with new SGVector methods and linalg methods. getting some mixed results. Have some academic commitments so going to be slow for a bit, will update with details soon ^^

karlnapf commented 8 years ago

Great, let us know when you had a chance to continue

Saurabh7 commented 8 years ago

@karlnapf some udpates:

Used SGVector and linalg::dot to compute distances between vectors which led to decreased performance for some reason. Had a chat with @lambday in irc regarding linalg stuff. Anyway, to make this conclusive we decided this might need benchmarking itself with something like vectors starting with 2 dimensions going to say 100k dims logarithmically
used CMath::dot , passing vectors as pointers (shuld be possible with linalg::dot soon with a new wrapper for pointers) along with precomputed x_squared_norms. This gave slightly better performance in some cases:

k	10	100
scik	3.1	32.5
shog	3.6	39.6

As for high level implementation only differences left as I see are the way they handle convergence and some minor things like how to handle empty clusters, etc.

karlnapf commented 8 years ago

Thanks. Interesting that we are still slower! Shogun being slower should not be the case if the Implementation is similar
The linalg dot is interesting -- it should also not be the case ...
For convergence, did you set the sklearn criterion to mimic Shoguns one as we discussed above?
Also, could you put your codes into a notebook and post the link here, so that we can see what is going on in a single file? Like #2991
Finally @kno10 is really right. The implementation is overly complicated. At some point, we should make it simpler to read

Saurabh7 commented 8 years ago

Yes I did set the tolerance to zero. The current implementation is definitely too complicated and is almost 10x slower without this changes. The unnecessary things can be easily removed though. not all implementation details are same, they handle some things with cython too, but we can definitely improve with time. I can keep trying in parallel. Posting the ipy notebook soon.

Saurabh7 commented 8 years ago

@karlnapf alright i think I figured most of the differences out

I was actually wrongly calculating the lhs_squared_norms inside the iterations , fixed that moved it outside. Also removed some complicated linalg:: stuf i was using on matrices, using just linalg::dot on vectors now and it looks good
pre-calculating rhs_squared_norms now too. This is the one I have to do inside iterations after rhs (center matrix) has been updated for the iteration.
also tried something like SGVector<float64_t> CEuclideanDistance::multiple_distance(int32_t idx_a) and computing using matrices, as discussed in your one point, this worked good for bigger k but for smaller k it was not worth. So did not use this.

Results for both libraries after this changes are always very close now for different k ;) Using all this I think I understand why the performance diference was always so big for this data.

Since randomly initialized and results vary while being close, average over 10 runs:

instances = ~20,000

k	10	100	1000
scikit	3.47	27.45	50.7
shogun	2.79	27.01	54.9

karlnapf commented 8 years ago

Nice! Ok, so it seems sklearn does somehow better for large k, where for small k the Python overhead makes it slower. This must be some systematic thing as well. However, this is great work and we should get it merged in the next step. Can you send a PR? The other thing you might want to check is using openmp to parallelise the distance computation among multiple cores. BUT I would do that after your current changes are merged.

karlnapf commented 8 years ago

This is now solved

shogun-toolbox / shogun

k-Means is very slow, scales badly with k #2987