Closed gut closed 7 years ago
Can one of the admins verify this patch?
@mlpack-jenkins test this
Thanks for the contribution. I think there are multiple solutions for the issue, and before we merge this in, I think it would be a good idea to talk about each.
I can run the KMeans benchmark either with a single dataset or with two datasets where the second dataset is the centroids file. Here is an example of the corresponding config:
- files: [ ['datasets/waveform.csv', 'datasets/waveform_centroids.csv'] ]
Running the KMeans benchmark on a single dataset without a centroids file is ambiguous or let me put is this way it's not consistent. For the shogun script we do:
- files: [ ['datasets/1000000-10-randu.csv'] ]
for weka we do:
- files: ['datasets/1000000-10-randu.csv']
which fails if the script expects a list as input. I think one solution would be to say, we require a list even if we run the benchmark on a single dataset as input or we handle both cases inside the script, e.g. by checking type(dataset) == type(list)
. Let me know what you think.
On this one I would say that we should definitely be starting the k-means simulations all from the same initial centroids, otherwise we are not testing the implementation. There was another discussion in these issues some time back about this... I guess if you wanted you could read that but it was a lot of words... in any case, my interest in benchmarking k-means is primarily in benchmarking different implementations of Lloyd iterations---not the actual algorithm because it always converges to the same place. As a result I would say it's important we start all k-means runs from the same initial centroids, so that the calculation takes the same number of iterations and converges to the same results.
That said, incorporating this fix anyway, for anyone who wanted to do k-means benchmarking differently, would be completely the right thing to do in my opinion.
Completely agree with you, for the KMeans benchmark its essential to use initial centroids. But let me generalize my question if we benchmark a method that can be run with 1, 2, 3 ... datasets how do we define those datasets? For >= 2 we use:
- files: [ ['datasets/one.csv', 'datasets/two.csv'] ]
for a single dataset we can use:
- files: [ ['datasets/one.csv'] ]
or
- files: ['datasets/one.csv']
or both? I'm fine with each solution, but I think we should agree on something and use that representation all over the place.
Ah, I see what you mean, I could go either way, but I think the former ([[ 'dataset' ]]
) is the better way to go since it then looks the same as when you pass multiple datasets.
I'm with @rcurtin too, the - files: [ ['datasets/one.csv'] ]
matches the same data structure as the others with centroid. Keep it simple and consistent for all cases is best for maintainability.
In practice that would mean to revert 81346dfe5f73e7f9a6a57246cf11763d6cae905a and change the config.yaml
, right?
I agree consistency is important, and you are right we should revert 81346df and adjust the config file instead.
Thanks again for the great contribution.
Stack trace happening without this patch:
benchmark affected: 1000000-10-randu