mlpack / benchmarks

Machine Learning Benchmark Scripts
101 stars 49 forks source link

Include ELKI in the benchmarks #9

Closed kno10 closed 6 years ago

kno10 commented 8 years ago

ELKI has much faster clustering implementations than Weka. In particular, it includes some fast k-means variants. It would be interesting to see how it competes in your benchmarks.

It's java, but it an easily be launched from the command line. The GUI will assist in choosing the command line parameters.

With -time it will output some timing information, and -resulthandler DiscardResultHandler will prevent writing the result to stdout. A basic KMeans command line looks like this:

java -jar elki-bundle-0.7.1.jar cli \
-dbc.in waveform.csv \
-algorithm clustering.kmeans.KMeansCompare -kmeans.k 2 \
-time -resulthandler DiscardResultHandler

and the output then is:

de.lmu.ifi.dbs.elki.datasource.FileBasedDatabaseConnection.load: 86 ms
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.initialization: de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.initialization.RandomlyChosenInitialMeans@6debcae2
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.variance-sum: 667233.4919999999
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.distance-computations: 10000
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.variance-sum: 266349.30295619805
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.distance-computations: 20001
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.variance-sum: 263882.95153705264
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.distance-computations: 30001
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.variance-sum: 263825.90708396287
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.distance-computations: 40001
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.variance-sum: 263814.94259269274
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.distance-computations: 50001
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.variance-sum: 263812.85525821644
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.distance-computations: 60001
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.variance-sum: 263811.528195029
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.distance-computations: 70001
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.variance-sum: 263810.6052650743
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.distance-computations: 80001
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.variance-sum: 263810.4244548899
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.distance-computations: 90001
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.iterations: 8
de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.KMeansCompare.runtime: 83 ms

Of course it will not scale as good as a C++ application, but it will be much more competitive than Weka. I would appreciate if you include it in your benchmarks. As you can see, it took about as long to cluster as it took the read the input file...

ELKI has several k-means variants. Sort, Compare, Elkan and Hamerly are the faster versions. Lloyd is usually not competitive with these.

rcurtin commented 8 years ago

Hey Erich,

Good to hear from you. (We met at, I think, ICML some years ago. Or some conference. I don't remember...)

The faster k-means algorithms are definitely better choices the vast majority of the time. These would be easy to include, but I think that it would be a lot easier if you could write a Python script that could do the benchmarking. Then the benchmarking system could collect the results easily. Are you willing to do this? I think you are the ELKI expert so it would probably be easiest if you wrote the benchmarking scripts. :)

You could base an implementation on the Weka implementation: https://github.com/zoq/benchmarks/blob/master/methods/weka/kmeans.py

I think it would be great to include more libraries in the benchmarking system.

kno10 commented 8 years ago

I believe it was at SDM in Philadelphia where we met. When I stumbled across MLpack again recently I remembered.

For integrating ELKI, the command line as shown above should do; the MiniGUI will help with other algorithms (the dropdown for the algorithms will list all the variants we currently have).

I had a look at your benchmark system (I have my own benchmarking system here, but no results published from that so far - it's a rather unreadable mess of Makefiles to only run missing experiments; along with launcher scripts). So far, I did not yet include MLpack; but I have sklearn, scipy (without sklearn), R, Julia, Octave, ELKI, Weka, SPMF, Smile, JSAT, Spark, Mahout, HubMiner, Apache Commons, and some lowlevel code. But I only test three algorithms currently, on just two data sets.

I didn't find my way around in your benchmark easily, and cloning the data sets just didn't do anything here. I managed to get above waveform.csv via the web site. Maybe its bitbucket. I'm pretty sure it's easier for you to just put that command line into your benchmarking system (and maybe a regexp to capture that last runtime: [0-9] ms line), rather than if I try to get it all working here. Otherwise, I could only submit untested code, and I'm still using Python 2, not 3.

rcurtin commented 8 years ago

Ah, yeah, it was SDM, I remember now.

I think Bitbucket is very slow sometimes, I've had problems cloning the datasets repository before. Sometimes it takes a very long time...

I'll try to help when I have time, but I'm pretty overloaded with other things, so it may be quite a while indeed before I'm able to do this.

kno10 commented 8 years ago

Btw, I've just been adding MLpack to my benchmarks, using the precompiled binaries from Debian. The kd-tree boruvka is nice, the regular boruvka seems to be much slower than Prim's algorithm would be (on a dense matrix, prim is O(n^2), but Boruvka O(n^2 log n) if I am not mistaken). Elkan k-means as well as the dual-tree k-means are very slow / fail for me. There must be something wrong with Elkan's - 8000 sec as opposed to 53s for Hamerly's clearly spells "bug" to me (Hamerly's as well as my implementation of Elkan take 10 sec).

rcurtin commented 8 years ago

On Tue, Mar 22, 2016 at 09:52:54AM -0700, Erich Schubert wrote:

Btw, I've just been adding MLpack to my benchmarks, using the precompiled binaries from Debian. The kd-tree boruvka is nice, the regular boruvka seems to be much slower than Prim's algorithm would be (on a dense matrix, prim is O(n^2), but Boruvka O(n^2 log n) if I am not mistaken).

Yes, the "naive" algorithm implemented there is not meant to be fast.

Elkan k-means as well as the dual-tree k-means are very slow / fail for me. There must be something wrong with Elkan's.

The dual-tree k-means algorithm is for very specific cases when N is quite large and k is also quite large. If you have a dataset with N in the hundreds of thousands or more and k in the hundreds or thousands and the dimensionality is not too huge (like maybe 50 or less) then it will be competitive. For small datasets, not at all.

Can you tell me more about how the Elkan algorithm is failing? Elkan's algorithm uses a lot of RAM but I don't think there are any particular problems in mlpack's implementation.

Ryan Curtin | "Kill them, Machine... kill them all." ryan@ratml.org | - Dino Velvet

kno10 commented 8 years ago

N is 110250, and k is 100. Is that large enough? d is 8, so indexes work well. Elkan worked, but was way too slow, 800x slower than expected. Maybe it hadn't failed, but I had killed it the previous attempt. My best guess is that they do all 10000 iterations (the iteration limit I have set).

rcurtin commented 8 years ago

On Tue, Mar 22, 2016 at 11:36:28AM -0700, Erich Schubert wrote:

N is 110250, and k is 100. Is that large enough? d is 8, so indexes work well. Elkan worked, but was way too slow, 800x slower than expected. Maybe it hadn't failed, but I had killed it the previous attempt. My best guess is that they do all 10000 iterations (the iteration limit I have set).

Benchmarking k-means is difficult. Are you checking that every centroid is converging to the same final location? If your runs encounter empty clusters at some point, every library is going to handle this differently. For instance if you run mlpack without the --allow_empty_clusters option, it will try to reinitialize the centroids when they are empty.

Also worth considering is that if you are benchmarking k-means with initial centroids that result in empty clusters, then your "effective" k value is actually smaller than it would otherwise be.

Elkan's algorithm is going to get very slow with large datasets since it maintains 2_k_N bounds. If other implementations of Elkan's algorithm with the same inputs (and that did the same number of iterations) are 800x faster than mlpack's, then that's worth looking into, but I'm still not yet sure that we have an issue here.

That k and N may be too small for the dual-tree algorithm to really work well. (In addition to k and N and d, the runtime will also depend on the distribution of the centroids at each iteration and also the distribution of the points.) At that dataset size I would expect the dual-tree algorithm to be within the same order of magnitude of runtime as the Pelleg-Moore algorithm.

Ryan Curtin | "Give a man a gun and he thinks he's Superman. ryan@ratml.org | Give him two and he thinks he's God." - Pang

kno10 commented 8 years ago

It's similar with k=1000, too. dualtree-covertree fails with a sig11 segmentation fault. I have been using -e to allow empty clusters, as this appears to be the default in most other implementations.

First run has not completed yet, so the "average" is over 1 sample:

out/MLpack-2.0.1/aloi-hsb8-km-k100-pelleg:   avg:   10.3
out/MLpack-2.0.1/aloi-hsb8-km-k100-hamerly:  avg:   22.4
out/MLpack-2.0.1/aloi-hsb8-km-k100-lloyd:    avg:   53.63
out/MLpack-2.0.1/aloi-hsb8-km-k100-dualtree: avg: 4555.73
out/MLpack-2.0.1/aloi-hsb8-km-k100-elkan:    avg: 8689.5
(dualtree-covertree segfaulted.)
out/MLpack-2.0.1/aloi-hsb8-km-k1000-pelleg:  avg:   33.84
out/MLpack-2.0.1/aloi-hsb8-km-k1000-lloyd:   avg:  325.63
(elkan is not yet finished with the first run, at 12+ hours...)

From these results, elkan, dualtree and dualtree-covertree are defect.

For comparison, some other results (average of 11 runs):

out/ELKI-0.7.1/aloi-hsb8-km-k100-hamerly:  avg:  8.05636
out/ELKI-0.7.1/aloi-hsb8-km-k100-elkan:    avg: 10.5909
out/ELKI-0.7.1/aloi-hsb8-km-k100-macqueen: avg: 13.1809
out/ELKI-0.7.1/aloi-hsb8-km-k100-lloyd:    avg: 24.5155

Your benchmark only includes the default k-means for MLpack right now, I assume? It would be good to include all variants; also to check that they are all working correctly.

rcurtin commented 8 years ago

On Thu, Mar 24, 2016 at 01:58:26AM -0700, Erich Schubert wrote:

It's similar with k=1000, too. dualtree-covertree fails with a sig11 segmentation fault. I have been using -e to allow empty clusters, as this appears to be the default in most other implementations.

If I can get a copy of the dataset and initial centroids you are using, I can take a look into the cover tree failure.

Note that using -e does not guarantee that the number of iterations will be the same as with other toolkits; that is something that you really need to check. Different tolerance and termination conditions could mean that ELKI is only doing half the iterations as mlpack.

Consider this: some implementations, when encountering empty clusters, might set that centroid to [DBL_MAX DBL_MAX DBL_MAX ... ]; some others might blacklist it, and some others still might set it to [0 0 0 0 ...](or might not even set it at all) and the behavior may be different depending on the dataset. I don't remember what mlpack does here and I'm not certain of what ELKI does either. But what I do remember is that I spent a lot of time looking into this some years ago and my conclusion is that you must choose initial centroids that converge with no empty clusters.

Also, is it a useful benchmark if you say you set k to 1000, but 995 clusters are empty? Then effectively k is 5 and you have a misleading benchmark.

First run has not completed yet, so the "average" is over 1 sample:

out/MLpack-2.0.1/aloi-hsb8-km-k100-pelleg:   avg:   10.3
out/MLpack-2.0.1/aloi-hsb8-km-k100-hamerly:  avg:   22.4
out/MLpack-2.0.1/aloi-hsb8-km-k100-lloyd:    avg:   53.63
out/MLpack-2.0.1/aloi-hsb8-km-k100-dualtree: avg: 4555.73
out/MLpack-2.0.1/aloi-hsb8-km-k100-elkan:    avg: 8689.5
(dualtree-covertree segfaulted.)
out/MLpack-2.0.1/aloi-hsb8-km-k1000-pelleg:  avg:   33.84
out/MLpack-2.0.1/aloi-hsb8-km-k1000-lloyd:   avg:  325.63
(elkan is not yet finished with the first run, at 12+ hours...)

From these results, elkan, dualtree and dualtree-covertree are defect.

If the Pelleg-Moore algorithm only took 10 seconds and the dual-tree algorithm took 4500 seconds, then it would indeed seem like something is wrong. But I am not yet convinced that this is not an issue with initial centroids and convergence conditions.

For comparison, some other results (average of 11 runs):

out/ELKI-0.7.1/aloi-hsb8-km-k100-hamerly:  avg:  8.05636
out/ELKI-0.7.1/aloi-hsb8-km-k100-elkan:    avg: 10.5909
out/ELKI-0.7.1/aloi-hsb8-km-k100-macqueen: avg: 13.1809
out/ELKI-0.7.1/aloi-hsb8-km-k100-lloyd:    avg: 24.5155

Your benchmark only includes the default k-means for MLpack right now, I assume?

Yes, the benchmarks right now are only for the naive case. I have not had time to redo them for every case.

It would be good to include all variants; also to check that they are all working correctly.

Each of the mlpack implementations is rigorously tested. Except for an unusual edge case, I would be surprised if there is a bug here.

If you send me the datasets, I will attempt to reproduce your results.

Ryan Curtin | "Do they hurt?" ryan@ratml.org | - Jessica 6

kno10 commented 8 years ago

The data set is here (you need to remove the labels for your parser): http://elki.dbs.ifi.lmu.de/datasets/multiview/aloi/aloi-hsb-2x2x2.csv.gz The command line used is: mlpack_kmeans -i data/aloi-hsb-2x2x2-nolabels.csv -c 100 --seed 1 -m 100000 -e -a elkan and the MLpack version used is 2.0.1.

Yes, tools react differently to empty clusters. I wouldn't be surprised if many go NaN and kill that centroid. But runtimes across different toolkits are fairly consistent, except for such outliers (but e.g. Lloyd and Hamerly in MLpack are consistent, too). For fair benchmarking, I also try to disable any threshold-based convergence; all implementations are expected to stop only when no points are reassigned (and I'm fairly certain none of the "fast" ones I have in the benchmark does stop early). Because not all implementations allow easy choosing of initial centers, it is hard to configure them all with the exact same starting conditions. ELKI should leave a center where it is if no points were assigned and stop only when no reassignments happen anymore. Sometimes the center will be "resurrected", unless it initialized very badly. A nice effect of this is that it ensures clean convergence. You can probably get stuck in an endless loop if you resurrect empty clusters at a different location.

Here are the cluster sizes of ELKI's Elkan:

741 822 1075 1320 685 1527 443 833 3803 892
371 2021 633 1434 834 530 1000 982 1787 773
757 618 618 192 591 1060 246 3185 1929 467
1127 174 340 2679 2038 1825 370 1655 756 366
656 1662 463 3934 1136 1271 1118 2481 2176 705
386 452 1207 1251 2288 1417 479 599 2354 1574
751 3254 113 1170 244 3801 371 190 739 1097
1244 496 427 1521 884 4006 532 519 165 450
585 197 1968 785 1350 424 1193 759 141 471
540 1636 1112 457 872 226 1322 1425 160 1545

i.e. they have not degenerated, there are no empty clusters.

rcurtin commented 8 years ago

On Thu, Mar 24, 2016 at 08:04:42AM -0700, Erich Schubert wrote:

The data set is here (you need to remove the labels for your parser): http://elki.dbs.ifi.lmu.de/datasets/multiview/aloi/aloi-hsb-2x2x2.csv.gz The command line used is: mlpack_kmeans -i data/aloi-hsb-2x2x2-nolabels.csv -c 100 --seed 1 -m 100000 -e -a elkan and the MLpack version used is 2.0.1.

Thanks for the information and the dataset. Yes, it appears you have found a bug in mlpack, but I think this arises because you are running with too many empty clusters:

$ bin/mlpack_kmeans -i ~/datasets/aloi-hsb-2x2x2-nolabels.csv -c 100 -m 1 -e -a hamerly -v [WARN ] --output_file, --in_place, and --centroid_file are not set; no results will be saved. [INFO ] Loading '/home/ryan/datasets/aloi-hsb-2x2x2-nolabels.csv' as CSV data. Size is 8 x 110250. [INFO ] Hamerly prunes: 0. [INFO ] Cluster 0 is empty. [INFO ] Cluster 2 is empty. [INFO ] Cluster 4 is empty. [INFO ] Cluster 8 is empty. [INFO ] Cluster 13 is empty. [INFO ] Cluster 14 is empty. [INFO ] Cluster 18 is empty. [INFO ] Cluster 19 is empty. [INFO ] Cluster 20 is empty. [INFO ] Cluster 21 is empty. [INFO ] Cluster 22 is empty. [INFO ] Cluster 25 is empty. [INFO ] Cluster 26 is empty. [INFO ] Cluster 28 is empty. [INFO ] Cluster 31 is empty. [INFO ] Cluster 35 is empty. [INFO ] Cluster 36 is empty. [INFO ] Cluster 37 is empty. [INFO ] Cluster 38 is empty. [INFO ] Cluster 39 is empty. [INFO ] Cluster 40 is empty. [INFO ] Cluster 44 is empty. [INFO ] Cluster 45 is empty. [INFO ] Cluster 47 is empty. [INFO ] Cluster 49 is empty. [INFO ] Cluster 50 is empty. [INFO ] Cluster 53 is empty. [INFO ] Cluster 55 is empty. [INFO ] Cluster 58 is empty. [INFO ] Cluster 59 is empty. [INFO ] Cluster 61 is empty. [INFO ] Cluster 63 is empty. [INFO ] Cluster 64 is empty. [INFO ] Cluster 65 is empty. [INFO ] Cluster 66 is empty. [INFO ] Cluster 67 is empty. [INFO ] Cluster 73 is empty. [INFO ] Cluster 75 is empty. [INFO ] Cluster 80 is empty. [INFO ] Cluster 82 is empty. [INFO ] Cluster 83 is empty. [INFO ] Cluster 84 is empty. [INFO ] Cluster 85 is empty. [INFO ] Cluster 87 is empty. [INFO ] Cluster 88 is empty. [INFO ] Cluster 89 is empty. [INFO ] Cluster 91 is empty. [INFO ] Cluster 94 is empty. [INFO ] Cluster 95 is empty. [INFO ] Cluster 96 is empty. [INFO ] Cluster 97 is empty. [INFO ] KMeans::Cluster(): iteration 1, residual inf. ...

So really this trial is not benchmarking with 100 centroids, it's benchmarking with about half that. If ELKI is not encountering empty centroids, it's because the initialization strategy is different, and thus the two implementations can't be directly compared like this. The only comparison you can make is "how do these libraries perform with their defaults?", and that is an entirely different question.

Bad initialization gives bad outputs. I will look into why the Elkan and dual-tree implementations do not recover from the inf residual, and I will update this ticket when I find (and fix) the issue.

But runtimes across different toolkits are fairly consistent

Are you sure?

$ for i in seq 1 10; do bin/mlpack_kmeans -i ~/datasets/aloi-hsb-2x2x2-nolabels.csv -c 100 -m 100000 -e -a pelleg-moore -v | grep 'clustering|converged'; done [INFO ] KMeans::Cluster(): converged after 168 iterations. [INFO ] clustering: 3.318723s [INFO ] KMeans::Cluster(): converged after 363 iterations. [INFO ] clustering: 6.954681s [INFO ] KMeans::Cluster(): converged after 249 iterations. [INFO ] clustering: 4.974800s [INFO ] KMeans::Cluster(): converged after 215 iterations. [INFO ] clustering: 4.041805s [INFO ] KMeans::Cluster(): converged after 181 iterations. [INFO ] clustering: 3.637091s [INFO ] KMeans::Cluster(): converged after 106 iterations. [INFO ] clustering: 2.074183s [INFO ] KMeans::Cluster(): converged after 185 iterations. [INFO ] clustering: 3.633194s [INFO ] KMeans::Cluster(): converged after 103 iterations. [INFO ] clustering: 2.046965s [INFO ] KMeans::Cluster(): converged after 106 iterations. [INFO ] clustering: 2.162690s [INFO ] KMeans::Cluster(): converged after 160 iterations. [INFO ] clustering: 3.035366s

Not only are the number of iterations and the runtime seeing a lot of variance, but as you increase the dataset size and the number of centroids this effect is likely to increase. And on top of that, the initialization strategy that is used for k-means can significantly affect how many iterations are taken until convergence. You can use the Bradley-Fayyad refined start approach with mlpack (-r) and that may improve things (also, it may not).

Because not all implementations allow easy choosing of initial centers, it is hard to configure them all with the exact same starting conditions.

This may be true, but in those cases I simply have ignored libraries where you can't specify the starting points, since I believe that these benchmarks are somewhat without value due to the high amount of variability. I think for Shogun I actually had to submit a patch upstream to get the support, if I remember right.

I will let you know when I have worked out the issue with the Elkan and dual-tree implementations, but I strongly urge you to reconsider your approach here, because with all due respect I think this benchmarking strategy is flawed.

Ryan Curtin | "Great party, isn't it?" ryan@ratml.org | - Delbert Grady

kno10 commented 8 years ago

"Fairly consistent" as e.g. in Hamerly consistently outperforming Lloyd, and also tool A consistently outperforming tool B in many cases. Of course there is quite some variance due to random initialization, but it's on a much smaller magnitude than the other factors observed. Shogun for example very clearly had a major problem in this benchmark (reported and resolved by now). Maybe you copied the bad initalization scheme from them? From a quick look, it appears as if you are doing the same in MLlib. If you assign every point to a cluster at random, then all means will initially be close to the global mean, very close to each other. This is maybe what is causing all the empty clusters you are seeing. A much simpler and much better approach is simply choosing k objects at random (attributed to Forgy). I could not find a source that suggested to assign points randomly to clusters (as discussed above, this is not a very good strategy).

rcurtin commented 8 years ago

On Thu, Mar 24, 2016 at 02:03:51PM -0700, Erich Schubert wrote:

"Fairly consistent" as e.g. in Hamerly consistently outperforming Lloyd, and also tool A consistently outperforming tool B in many cases. Of course there is quite some variance due to random initialization, but it's on a much smaller magnitude than the other factors observed.

Fair enough; it depends on what you plan on using the benchmarks to say. For the benchmarking tasks I am interested in, I need to ensure that the initial centroids are all the same.

It might be worth trying to ensure that at the very least the centroid initialization strategy is the same, but honestly I think that's an even harder task in practice than simply ensuring the initial centroids are the same.

I still think that the variance you'll see with larger datasets will be more and more significant, but I don't have data to back that up. I also don't know how large your benchmarks are going, so I don't know how much that will matter.

Shogun for example very clearly had a major problem in this benchmark (reported and resolved by now). Maybe you copied the bad initalization scheme from them? From a quick look, it appears as if you are doing the same in MLlib. If you assign every point to a cluster at random, then all means will initially be close to the global mean, very close to each other. This is maybe what is causing all the empty clusters you are seeing. A much simpler and much better approach is simply choosing k objects at random (attributed to Forgy). I could not find a source that suggested to assign points randomly to clusters (as discussed above, this is not a very good strategy).

You are correct, this is a good point. I wrote that code many years ago, so we can assume that 2010 Ryan was stupid and hope that 2016 Ryan can do something less stupid. I'll change that to the strategy you suggested and let you know when the change is made.

I haven't looked into the failure issue yet, but it seems like you've uncovered a pretty grevious bug with your PR. I'm still trying to figure out how the algorithm works at all with the code the way it is, resetting all centroids to DBL_MAX. (Also that behavior wouldn't be specific to Elkan, so maybe there are other things going on here.)

Ryan Curtin | "I love it when a plan comes together." ryan@ratml.org | - Hannibal Smith

rcurtin commented 8 years ago

(Also that behavior wouldn't be specific to Elkan, so maybe there are other things going on here.)

My bad, this is incorrect. I misread the PR.

kno10 commented 8 years ago

Let's move the MLpack specific discussions to https://github.com/mlpack/mlpack/pull/592

Since I've been seeing kmeans differences of 100x runtime that are attributed to other implementation details than initialization, it is good enough for my purpose to benchmark with each method using a seed to initialize. Different seeds tend to cause a factor of 2x in runtime only.

For example, these are three implementations of Hamerly, 11 different seeds each:

ELKI-0.7.1   aloi-hsb8-km-k100-hamerly avg:  8.05636 c: 11 min:   6.27 max:  10.27 med:  8.04
JSAT-0.0.3   aloi-hsb8-km-k100-hamerly avg: 10.5891  c: 11 min:   8.95 max:  12.47 med: 10.32
Hamerly      aloi-hsb8-km-k100-hamerly avg: 15.2518  c: 11 min:  10.08 max:  20.78 med: 14.41

You can see that even with random seeds, the differences are fairly consistent (interestingly, Hamerlys own C implementation comes out slower than my Java implementation of his algorithm).

If we go to Forgy/Lloyd's algorithm, runtimes increase much more:

ELKI-0.7.1   aloi-hsb8-km-k100-lloyd avg:  24.5155 c: 11 min:   15.82 max:   35.91 med:  21.59
HUBM-1.2     aloi-hsb8-km-k100-lloyd avg:  26.3745 c: 11 min:   22.64 max:   31.30 med:  23.56
R-3.2.3      aloi-hsb8-km-k100-lloyd avg:  23.8518 c: 11 min:   15.88 max:   31.46 med:  23.78
JSAT-0.0.3   aloi-hsb8-km-k100-lloyd avg:  31.6209 c: 11 min:   21.71 max:   43.20 med:  29.54
Hamerly      aloi-hsb8-km-k100-lloyd avg:  75.7973 c: 11 min:   38.90 max:  134.50 med:  67.04
Spark-1.6.0  aloi-hsb8-km-k100-lloyd avg:  85.6264 c: 11 min:   64.59 max:  112.18 med:  77.74
Shogun-3.2.0 aloi-hsb8-km-k100-lloyd avg: 137.009  c: 11 min:   98.80 max:  192.14 med: 129.26
SPMF-0.98d   aloi-hsb8-km-k100-lloyd avg: 265.49   c: 11 min:  187.16 max:  358.16 med: 253.49

Shogun used a bad initialization, and did twice as many distance computations as necessary (https://github.com/shogun-toolbox/shogun/issues/2987). I have not checked what SPMF does; nor why Hamerlys Lloyd is so much below par. Within each tool, the exact same seeds are used, if possible.

But initialization (as long as it uses the same strategy) seems to be a rather small factor of 2x that is reasonably captured by simply using the average or median of multiple runs. Rather few tools allow to use fixed seeds, so I did not bother to completely sovle this (I also used k-means++ where possible, and usually it was marginally better; but with some tools, k-means++ was substantially more expensive).

One key observation is that the differences between different implementations are much larger than expected; and it's not just the random generator.

rcurtin commented 8 years ago

Thanks for opening the bugs in mlpack, and sorry for the slow response on this. I went on vacation last week and now that I am back I am trying to catch up...

One very important statistic that I do not see here is the number of iterations taken for convergence; nor do I see the mean squared error or some other statistic denoting how good the final clustering is.

So it's possible that some implementations could simply be converging much much earlier than other implementations and this is what is to blame for the speed differences. You've already mentioned that you are checking for this condition, but it might be useful to print the number of iterations in the results.

Given all the possible variations in k-means, which we've discussed at length here, I personally think the right way to benchmark k-means is to start from the exact same initial centroids and see which implementations are faster, and I feel quite strongly about this so I am really not willing to change the way the benchmarks are currently run in this project. Back to the original purpose of this ticket, is it possible to set the initial centroids in ELKI? If so then writing a benchmarking script for ELKI should be straightforward using the information you gave way earlier in this ticket.

kno10 commented 8 years ago

I looked at iteration counts, but these do not have that much of an effect. With the advanced methods such as Elkan and Hamerly, the later iterations get really cheap. I tried looking at the variance sum (MSE, SSQ, RMSE, ...), but different tools compute very different things. Some even only report the sum of distances... I wanted to look at this, but I would have had to modify every single tool. Even just to get the output in a usable form, and compute the SSQ outside. ELKI does not yet have a fixed initialization. The best you can do to emulate (unless you want to add another initialization, which is easy) is to always choose k objects from the data set, and shuffle the data set, then use the first-k-objects initialization. As I know the ELKI code well, it iterates until no point is reassigned anymore (and thus, no further iterations will change anything anymore). It does not stop early because of some threshold - this is a feature I would like to eventually add. As mentioned before: I don't think the initial centroids are that important, as long as they are drawn randomly from the data. I could easily increase the sample size beyond 11; but 11 was good enough to have a quite similar ranking for min, mean, max, median. You could easily go to 100 runs, and probably not see anything different.

rcurtin commented 8 years ago

What you are suggesting and what I am suggesting are basically the same thing, except the variance is lower when you set the centroids and you avoid implementation differences for initialization strategies. I am saying, "use fixed centroids as an initialization strategy" and you are saying "sample points randomly as an initialization strategy" but the problem that you have either way is that the software you are benchmarking must support either the correct initialization strategy or setting the initial centroids.

It's the same problem: not all software supports both (or either). So the choices are, benchmark the libraries that support setting initial centroids or benchmark the libraries that support the correct initialization scheme. But the problem with the latter is that minor implementation details may mean that some libraries to have consistently better initializations than others (yes, if it is implemented correctly, that is not an issue, but I would not trust that every library implements the strategy exactly as documented, accidental bugs are common). And then your benchmarks are less meaningful.

My interest in k-means benchmarking is to determine which libraries can solve the exact same problem faster. It's possible to test other measures: one could benchmark k-means on a dataset with the library's default parameters and plot the SSE vs. the time taken. But I think this is a less useful benchmark, and it would probably take a lot of refactoring to the current benchmarking scripts to support some different benchmarking goal, or alternately we would have to add a whole new set of scripts for k-means to test something slightly different. You could do that---but it would be a lot of work.

kno10 commented 8 years ago

Correction: ELKI does have an initialization with predefined means. Apparently I already added this a long time ago.

-kmeans.initialization PredefinedInitialMeans -kmeans.means 1,2,3;4,5,6

Where a comma separates columns, the semicolon separates vectors, and of course the dimensions and number of centers must agree with your data and k.

rcurtin commented 8 years ago

Ah, okay, great, that makes it easy to add a new benchmarking script. I will see if I can eventually find time for this but I'm pretty overloaded for now. If you can supply a script that would be quicker, but I'll try to get to this when I can. :)