Closed kno10 closed 8 years ago
Update: I found another issue with this k-means. It is also initialized very badly. Every point is randomly assigned to a cluster, which will cause cluster centers to be very close to each other. The more common random initialization is to simply choose k random instances, and use these as centers. The k-means++ initialization does not appear to be available using the command line interface.
Ouch. It also chooses n
random instances during each iteration, instead of using each one time? That only means it randomly ignores about 1/3rd of instances in each iteration:
const int32_t Pat=CMath::random(0, XSize-1);
@kno10 yeah indeed when i started some benchmarking of shogun, i've started with k-means and currently i do agree that's it's in a rather horrible state. :(
For k-means, I've made a pull request #2988 with some of these changes, but I have not tried to compile it, nor tested or benchmarked it. But this may save you some work.
I don't know if anybody in Shogun has much interest in k-means, so #1263 may have a point.
Hierarchical clustering also benchmarked rather slow. It runs out of memory at 50000 and took 32 seconds for 20000 instances. The better implementations all do 20000 in 3 seconds and scale to 500000 instances without problems.
Hi @kno10, thanks a lot for your analysis and the comments.
Regarding the first observation in your first comment, why do you say that mus
is never used in the second loop (the for
loop that starts after the block if (!fixed_centers) { ... }
ends)?
mus
is only modified, but never read. All distances are computed to the copy in rhs_mus. The algorithm terminates when the centers did not change anymore (then mus has not changed either in the second loop). At the beginning of the next iteration, mus
is set to zero again in line 53.
Thus, the only case when these updated values are used is if max_iter is hit. Then the updated values from that last iteration are returned - but that is not any better than returning the previous centers to which the points were last assigned (neither is a consistent solution, because it has not converged).
I think (perhaps I am getting it wrong) that mus
is both read and written. Note that in lines 110 and 124 the operator used is -=
. It is not probably not the most readable way of doing this :-)
Does that make sense to you?
Its modifed, but not used afterwards. There are no "side effects" beyond updating mus
, are there?
See: it does all this effort to update the means. And then in the next iteration, it is reset to 0, and recomputed from scratch. Does that make sense to you?
It does, because fixed_centers
can be true. However, I do agree with you that all the work done to update mus is a waste in case fixed_centers
is false -- which may be the most common use case.
I think we should break up this method into two methods, and use either of them depending on the value of fixed_centers
.
It is great that you pointed this out! Thanks a lot.
fixed_centers
makes no sense, because if the centers are fixed, iterating is nonsense - it's not k-means anymore. fixed_centers
should be handled by the NN classifier instead, every point is assigned the nearest center as label.
@kno10, can you share the code you used for the benchmark results you mentioned above? A gist or similar would work fine.
Thanks a lot @kno10 for investigating this. I totally am for dropping such disasters, as they are doing more bad than good.
In fact, I am thinking about a GSoC project that is just about going through our "simple" ML algorithms, benchmarking them a bit, identifying problems (as you did here), and then either dropping or re-writing.
@kno10 I just had a look. If you are interested, you could attempt a re-write that solves the above problems. This might be a nice entrance task for the mentioned GSoC project as well
Sorry, my focus is on ELKI. I used Shogun solely for comparison.
No worries. Thanks! I used your report in a GSoC project proposal here: https://github.com/shogun-toolbox/shogun/wiki/GSoC_2016_project_fundamental_usual_suspects
Hopefully we resolve a few of those during the summer!
hi guys, ran some timers on handy clustering datasets, results (seconds) : The shogun(new) is after updating some of the changes discussed above
instances = ~5000
k | 10 | 100 | 1000 |
---|---|---|---|
scikit | 0.09 | 0.58 | 3.565 |
mlpack | 0.08 | 0.493 | 3.00 |
shogun | 0.18 | 1.61 | 8.19 |
shogun (new) | 0.09 | 0.99 | 4.96 |
instances = ~20,000
k | 10 | 100 |
---|---|---|
scikit | 2.65 | 32.53 |
mlpack | 1.14 | 127.68 |
shogun | 3.07 | 428.86 |
shogun (new) | 1.66 | 276.89 |
What do you guys think? I had already attempted a broader restructuring of KMeans class in #2648 #2815 but we were stuck because apparently there were not enough unit tests.
Good to see you back here @Saurabh7
Thanks a lot for this example. These results are quite extreme. Can you post the scripts that generated them somewhere? (Your own public repo for example). I would like to have a look at them.
Do the methods produce the same results?
We now have to find out what makes the other implementations so fast (and copy the design). Then we can attempt to fix things.
The Shogun (new) version that is based on the discussions above seems already much faster, but still way beyond the others -- so maybe rather than re-structuring we can have a look at the other libs and attempt a re-implementation from scratch....after all KMeans is a dead simple algorithm.
One reason for speed differences might be multicore and vectorised linear algebra, which are relatively straight forward to put in our too.
When re-implementing/re-structuring, we should make use of linalg
. @lambday can probably help here as well as he ran benchmarks before.
Thanks @karlnapf , I have posted the scripts at https://github.com/Saurabh7/shogun-benchmarks/tree/master/kmeans . You can reproduce quickly with the python scripts i guess. I haven't looked at the quality of the results yet.
As you said lot of things could be changed: using new SGVector,
SGMatrix
operators , linalg
operations, etc. after which we can compare the results again.
One problem I find with changing the code by re-implementing/structuring is that parts of the current implementation (some of which are oddly different from std implementations in other libraries) are not documented/tested. So maybe we would have to address that first too.
Thanks! Some comments:
Sanity check: you build the shogun library with compiler optimizations enabled right? That is DBUILD_TYPE=Release ? and the script as well?
I agree with all the other points. Using std anywhere is not really necessary.
I agree with the documenting/testing. However, if the current code design is slow and needs to be replaced anyways, there is not really a need for this. Maybe as a first step, you could have a look into the sklearn implementation to check what they are doing from a high level perspective (same data-structures and operations?). This should give an answer to the question why it is so fast compared to other libs.
@karlnapf Yes I did build with DBUILD_TYPE=Release. Updating with a few more changes:
mus
matrix. (Used a hack here as scaling each row of matrix by different scalar is not possible) instances = ~20,000
k | 10 | 100 |
---|---|---|
scikit | 2.65 | 32.53 |
shogun | 3.07 | 428.86 |
shogun_1 | 2.02 | 276.89 |
shogun_2 | 5.00 | 49.51 |
The increase for k=10 in the latest shogun_2 must be because of the new constant cost of computing distances during initialization.
Now some comparisons with scikit I came across:
dist(x, y) = sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y))
. If only one argument changes (as in case of kmeans) they use precomputed values of dot(x, x)
using x_squared_norms
. But in shogun we use simple loops for (int32_t i=0; i<alen; i++) result+=CMath::sq(avec[i] - bvec[i]);
To mimic the performance of computing distances in kmeans with varying k
i wrote some quick scripts https://github.com/Saurabh7/shogun-benchmarks/tree/master/distance, results(seconds) :
instances = ~ 20,000
k | 10 | 100 | 1000 |
---|---|---|---|
scikit | 1.94 | 20.49 | 226.03 |
shogun | 8.35 | 77.47 | 860 |
tol
convergence is declared.
In shogun change in each and every point's cluster is checked till there is no change.
Maybe this could lead to unnecessary iterations even when the WCSS is not changing a lot.Very good investigation @Saurabh7 I think we should focus on making our distance computation faster then since that seems to be a major game changer. AFAK our Gaussian kernel pre-computes the squared norms of the diagonal of the kernel matrix. Maybe we can change our CDistance class to do the same here.
But let's go very systematically:
CCustomDistance
in Shogun, dont know how to do this for sklearn, but you could use a hack. Based on the outcome of this, we can identify more potential bottlenecks.CGaussianKernel
has inspiration for this.linalg
dot products to compute the distance rather than the loop over elementsdistance
calls. This should be done via a single call that uses dot products between matrices and then using linalg. We might need to add a method distance_multiple(SGVector indices_a, SGVector indicesb)
to the Distance class for this.The other points, can you comment on each?
@karlnapf:
sub
instructions.tol
) should not be the default. It causes non-intuitive results where a point is not assigned to the closest center, or the cluster center is not the mean of the assigned points (if both were the case, it would be converged). It can be good to approximate k-means if you need maximum speed, and for probabilistic approaches such as Mini-Batch-Kmeans it is the best that you can do. But I would default to running until convergence (and stop when no point is reassigned anymore), and keep this optional. @Saurabh7 for fairness, set tol=0
in sklearn and maxiter=10000
(a sufficiently large value) so sklearn also produces a converged result.Thanks for your comments @kno10
Setting the sklearn tolerance is a good idea to compare
Wrt. 1: What do you want to precompute? k-means does not use pairwise distances. And the means move around. So how would you precompute anything useful?
Ah crap, you are of course correct. Too many things going on at once. Precomputing is not useful at all. Thanks for the hint!
@Saurabh7 how are things going?
@karlnapf Regarding your previous points
Regarding squared norms: This could again be a case by case basis, depending on the data as discussed: might help or might not help much, could use up more resources, etc. This could be made optional too. I will try to find a way to compare this.
Next I am trying to use the precomputed squared norms in CEuclideanDistance
as you suggested also making changes with new SGVector
methods and linalg
methods. getting some mixed results.
Have some academic commitments so going to be slow for a bit, will update with details soon ^^
Great, let us know when you had a chance to continue
@karlnapf some udpates:
SGVector
and linalg::dot
to compute distances between vectors which led to decreased performance for some reason. Had a chat with @lambday in irc regarding linalg stuff. Anyway, to make this conclusive we decided this might need benchmarking itself with something like vectors starting with 2 dimensions going to say 100k dims logarithmically CMath::dot
, passing vectors as pointers (shuld be possible with linalg::dot
soon with a new wrapper for pointers) along with precomputed x_squared_norms
. This gave slightly better performance in some cases:k | 10 | 100 |
---|---|---|
scik | 3.1 | 32.5 |
shog | 3.6 | 39.6 |
Yes I did set the tolerance to zero. The current implementation is definitely too complicated and is almost 10x slower without this changes. The unnecessary things can be easily removed though. not all implementation details are same, they handle some things with cython too, but we can definitely improve with time. I can keep trying in parallel. Posting the ipy notebook soon.
@karlnapf alright i think I figured most of the differences out
lhs_squared_norms
inside the iterations , fixed that moved it outside. Also removed some complicated linalg::
stuf i was using on matrices, using just linalg::dot
on vectors now and it looks goodrhs_squared_norms
now too. This is the one I have to do inside iterations after rhs
(center matrix) has been updated for the iteration.SGVector<float64_t> CEuclideanDistance::multiple_distance(int32_t idx_a)
and computing using matrices, as discussed in your one point, this worked good for bigger k
but for smaller k
it was not worth. So did not use this.Results for both libraries after this changes are always very close now for different k ;) Using all this I think I understand why the performance diference was always so big for this data.
Since randomly initialized and results vary while being close, average over 10 runs:
instances = ~20,000
k | 10 | 100 | 1000 |
---|---|---|---|
scikit | 3.47 | 27.45 | 50.7 |
shogun | 2.79 | 27.01 | 54.9 |
Nice! Ok, so it seems sklearn does somehow better for large k, where for small k the Python overhead makes it slower. This must be some systematic thing as well. However, this is great work and we should get it merged in the next step. Can you send a PR? The other thing you might want to check is using openmp to parallelise the distance computation among multiple cores. BUT I would do that after your current changes are merged.
This is now solved
This variant of k-means (which is overly complicated) scales very badly.
Some observations:
if (!fixed_centers)
). Afterwards, it is incrementally updated, but that data is never used. All use ofmus
in the second loop can be deleted.get_feature_vector
).But that still does not explain my runtime measurements (using 3.2.0 - I only now realized that the version preinstalled with Linux distributions is much older - albeit there does not appear to be any relevant difference): k=10 5.91 seconds, k=100 137.009 seconds, k=1000 2409.55 seconds.
Where usually, the performance would increase by ~7.5 when increasing k by 10 (because of fewer iterations I guess), it increases by ~20 in Shogun with k, making it the slowest k-means I benchmarked except for flexclust (pure R), sklearn 0.14 (fixed since, 100x faster now), and scipy.cluster.vq. In these, the interpreter is probably to blame. For the record, the best K-means-variant I benchmarked on this data is 0.80, 3.76, 7.35; the best Lloyd implementation is 2.15, 42.16, 202.19... so Shogun is 12x slower than necessary, and better algorithms are over 300x faster.
Above observation (duplicate computation of centers) would explain a factor of 2 or 3 maybe. That would be enough to make k=10 competitive, but there is a superlinear cost with increasing k that I cannot discover in Shogun source code.
Note that #1263 suggests to drop legacy code like this completely.