Fix Shogun kmeans benchmark run

gut commented 7 years ago

Stack trace happening without this patch:

[INFO ] Dataset: 1000000-10-randu
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "methods/shogun/kmeans.py", line 66, in RunKMeansShogun
    data = np.genfromtxt(self.dataset, delimiter=',')
  File "/home/gut/venv/lib/python3.5/site-packages/numpy/lib/npyio.py", line 1541, in genfromtxt
    first_values = split_line(first_line)
  File "/home/gut/venv/lib/python3.5/site-packages/numpy/lib/_iotools.py", line 219, in _delimited_splitter
    line = line.split(self.comments)[0]
TypeError: Can't convert 'bytes' object to str implicitly

benchmark affected: 1000000-10-randu

mlpack-jenkins commented 7 years ago

Can one of the admins verify this patch?

zoq commented 7 years ago

@mlpack-jenkins test this

zoq commented 7 years ago

Thanks for the contribution. I think there are multiple solutions for the issue, and before we merge this in, I think it would be a good idea to talk about each.

I can run the KMeans benchmark either with a single dataset or with two datasets where the second dataset is the centroids file. Here is an example of the corresponding config:

- files: [ ['datasets/waveform.csv', 'datasets/waveform_centroids.csv'] ]

Running the KMeans benchmark on a single dataset without a centroids file is ambiguous or let me put is this way it's not consistent. For the shogun script we do:

- files: [ ['datasets/1000000-10-randu.csv'] ]

for weka we do:

- files: ['datasets/1000000-10-randu.csv']

which fails if the script expects a list as input. I think one solution would be to say, we require a list even if we run the benchmark on a single dataset as input or we handle both cases inside the script, e.g. by checking type(dataset) == type(list). Let me know what you think.

rcurtin commented 7 years ago

On this one I would say that we should definitely be starting the k-means simulations all from the same initial centroids, otherwise we are not testing the implementation. There was another discussion in these issues some time back about this... I guess if you wanted you could read that but it was a lot of words... in any case, my interest in benchmarking k-means is primarily in benchmarking different implementations of Lloyd iterations---not the actual algorithm because it always converges to the same place. As a result I would say it's important we start all k-means runs from the same initial centroids, so that the calculation takes the same number of iterations and converges to the same results.

That said, incorporating this fix anyway, for anyone who wanted to do k-means benchmarking differently, would be completely the right thing to do in my opinion.

zoq commented 7 years ago

Completely agree with you, for the KMeans benchmark its essential to use initial centroids. But let me generalize my question if we benchmark a method that can be run with 1, 2, 3 ... datasets how do we define those datasets? For >= 2 we use:

- files: [ ['datasets/one.csv', 'datasets/two.csv'] ]

for a single dataset we can use:

- files: [ ['datasets/one.csv'] ]

or

- files: ['datasets/one.csv']

or both? I'm fine with each solution, but I think we should agree on something and use that representation all over the place.

rcurtin commented 7 years ago

Ah, I see what you mean, I could go either way, but I think the former ([[ 'dataset' ]]) is the better way to go since it then looks the same as when you pass multiple datasets.

gut commented 7 years ago

I'm with @rcurtin too, the - files: [ ['datasets/one.csv'] ] matches the same data structure as the others with centroid. Keep it simple and consistent for all cases is best for maintainability.

In practice that would mean to revert 81346dfe5f73e7f9a6a57246cf11763d6cae905a and change the config.yaml, right?

zoq commented 7 years ago

I agree consistency is important, and you are right we should revert 81346df and adjust the config file instead.

zoq commented 7 years ago

Thanks again for the great contribution.

mlpack / benchmarks

Fix Shogun kmeans benchmark run #76