mlpack / benchmarks

Machine Learning Benchmark Scripts
101 stars 49 forks source link

Add ELKI to benchmarks #117

Closed kno10 closed 6 years ago

kno10 commented 6 years ago

Resolves #9

mlpack-jenkins commented 6 years ago

Can one of the admins verify this patch?

rcurtin commented 6 years ago

@mlpack-jenkins test this please

kno10 commented 6 years ago

Using the predefined centroids now, except for isolet where they are too large to pass via the command line as far as I can tell.

I have added some bug reports for other issues I noticed.

It would be better if we'd get more detailed error reporting. Sometimes, I would just get the error "[FATAL] Can't parse the data: wrong format" with no indication of where this error comes from; or "[FATAL] Could not execute command: ..." without any readable error message (on the isolet data with centroids; I could not get this to work).

rcurtin commented 6 years ago

@kno10---thanks again. When I have a chance I'll try to set up the ELKI build on our benchmarking infrastructure and get the jobs to run. Because of the long time involved with running the benchmarks, it could be a little while until we have results (and I'm a bit busy with other things, so it may be a little bit before I can get it working right too!).

kno10 commented 6 years ago

No need to hurry with rerunning things. There are other TODOs worth including (such as getting Weka to use the initial centroids); maybe some version updates, the k-means variants of greg hamerly https://github.com/ghamerly/fast-kmeans etc. There was also one data set, where the initial centroids were too much to pass by command line to ELKI; so a class would be needed to load them from file instead.

A must to include before rerunning is R kmeans. Because the kmeans in R is pretty good (Hartigan and Wong), and likely the most widely used kmeans.

For ELKI, it would be fair to make a truncated PCA; as we currently will always perform full PCA and then filter (which is interesting if you have different filtering strategies, not just top-k). And one could probably write a NN-join benchmark in ELKI using different indexing strategies.

On the long run, it would probably also be worth to evaluate the quality. E.g., with k-means, what is the final SSE, with PCA, how much variance was retained - in particular, some methods may have numerical issues, may stop too early (which makes them appear to be fast), may be approximate (in particular some truncated PCA strategies may; but there isn't really an "exact" solution), or may have errors.