mlpack / benchmarks

Machine Learning Benchmark Scripts
101 stars 49 forks source link

[WIP] Introduce Dockerfile #135

Open p16i opened 5 years ago

p16i commented 5 years ago

This is a first Dockerfile that aims to make the system more portable and easier to be run, addressing #133.

The Docker file is structured such that

This image is built using a modified config.yaml. In particular, Shogun's KMEANS and DTC sections are:

library: shogun
methods:
   KMEANS:
        run: ['metric']
        iteration: 3
        script: methods/shogun/kmeans.py
        format: [arff, csv, txt]
        datasets:
            - files: [ ['datasets/waveform.csv', 'datasets/waveform_centroids.csv'] ]
              options:
                clusters: 2

            - files: [ ['datasets/wine.csv', 'datasets/wine_centroids.csv'],
                       ['datasets/iris.csv', 'datasets/iris_centroids.csv'] ]
              options:
                clusters: 3
    DTC:
        run: ['timing', 'metric']
        script: methods/shogun/decision_tree.py
        format: [csv, txt, arff]
        datasets:
            - files: [ ['datasets/iris_train.csv', 'datasets/iris_test.csv', 'datasets/iris_labels.csv'] ]

Results from executions.

Suppose relevant datasets have been downloaded already from make datasets. The image is built using the following command docker build -t benchmark.

> docker run -v `pwd`/datasets:/usr/src/benchmarks/datasets benchmark  BLOCK=shogun METHODBLOCK=KMEANS
[INFO ] CPU Model:  Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
[INFO ] Distribution:
[INFO ] Platform: x86_64
[INFO ] Memory: 3.8544921875 GB
[INFO ] CPU Cores: 2
[INFO ] Method: KMEANS
[INFO ] Options: {'clusters': 2}
[INFO ] Library: shogun
[INFO ] Dataset: waveform

           mlpack  matlab  scikit  mlpy    shogun  weka  elki  milk  dlibml
waveform        -       -       -     -  0.022962     -     -     -       -

[INFO ] Options: {'clusters': 3}
[INFO ] Library: shogun
[INFO ] Dataset: wine
[INFO ] Dataset: iris

       mlpack  matlab  scikit  mlpy    shogun  weka  elki  milk  dlibml
wine        -       -       -     -  0.000771     -     -     -       -
iris        -       -       -     -  0.000620     -     -     -       -

[INFO ] Options: {'clusters': 5}
[INFO ] Options: {'clusters': 6}
[INFO ] Options: {'clusters': 7}
[INFO ] Options: {'clusters': 26}
[INFO ] Options: {'clusters': 10}
[INFO ] Options: {'clusters': 75}
[INFO ] Options: {'centroids': 75}
> benchmarks $ docker run -v `pwd`/datasets:/usr/src/benchmarks/datasets benchmark  BLOCK=shogun METHODBLOCK=DTC
[INFO ] CPU Model:  Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
[INFO ] Distribution:
[INFO ] Platform: x86_64
[INFO ] Memory: 3.8544921875 GB
[INFO ] CPU Cores: 2
[INFO ] Method: DTC
[INFO ] Options: None
[INFO ] Library: shogun
[INFO ] Dataset: iris

       mlpack  matlab  scikit    shogun  weka  milk  R
iris        -       -       -  0.000817     -     -  -

Could you please give me feedback or comments? Meanwhile, I will add more libraries to the image.

Update :

zoq commented 5 years ago

This looks good to me, I'm wondering if it's possible to somehow generate a list of each package/version. Also I think the reason for excluding the datasets is to reduce the size? What about we tar the datasets folder?

p16i commented 5 years ago

hi, thanks for the comment.

I'm wondering if it's possible to somehow generate a list of each package/version.

Do you mean having a Docker image for each library?

Also I think the reason for excluding the datasets is to reduce the size? What about we tar the datasets folder?

Yes, the dataset directory is 2GB, which is too big to store in the container. If we keep it as a archive (.tar.gz), the size is around 700MB. I'm not sure whether it's worth.

What do you think?

zoq commented 5 years ago

Do you mean having a Docker image for each library?

Yeah, not sure that is something we should do since, in this case we would have to update not only the docker file that contains all libs but the single lib as well.

Yes, the dataset directory is 2GB, which is too big to store in the container. If we keep it as a archive (.tar.gz), the size is around 700MB. I'm not sure whether it's worth.

What do you think?

I see, I'd like to keep it as simple as possible, sharing the dataset folder might not be the easiest solution. Can you think of anything else, what we could do? Perhaps 700MB isn't that bad?

p16i commented 5 years ago

Before investigating further, may I ask how is your plan to run this container? Why do you think sharing the dataset directory isn't the easiest approach?

zoq commented 5 years ago

The easiest for me would be to have something that runs out of the box docker run is all I need. What about we provide a docker that includes the datasets and another one without. What do you think?