openml / automlbenchmark

OpenML AutoML Benchmarking Framework
https://openml.github.io/automlbenchmark
MIT License
405 stars 132 forks source link

Problems caused by different sklearn versions #49

Closed jernsting closed 4 years ago

jernsting commented 5 years ago

I am currently trying to implement a new automl approach, but this approach is using scikit-learn v0.21.3. If I try to run the benchmark I run into errors, because automlbenchmark is requiring scikit-learn v0.19.2.

Is there anything I can do to work around this?

PGijsbers commented 5 years ago

AutoML framework dependencies should be able to be set independently from the benchmark dependencies, as each are set up in their own environment (e.g. TPOT uses scikit-learn==0.18.1). See also the shared setup script.

If you could share the code you are trying out, it would be easier to figure out what is going on. Make sure the dependencies are specified in a requirements.txt in the framework folder.

@seb-h2o please correct me if my understanding of the setup/installation is outdated.

jernsting commented 5 years ago

I have chosen to use docker (which itself leads to an error, if the docker image is not already contained in the docker hub) and tried to use the generated Dockerfile:


ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update
RUN apt-get -y install apt-utils dialog locales
RUN apt-get -y install curl wget unzip git
RUN apt-get -y install python3 python3-pip python3-venv
RUN pip3 install -U pip

# We create a virtual environment so that AutoML systems may use their preferred versions of 
# packages that we need to data pre- and postprocessing without breaking it.
ENV PIP /venvs/bench/bin/pip3
ENV PY /venvs/bench/bin/python3 -W ignore
ENV SPIP pip3
ENV SPY python3

# Enforce UTF-8 encoding
ENV PYTHONUTF8 1
ENV PYTHONIOENCODING utf-8
# RUN locale-gen en-US.UTF-8
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8

RUN $SPY -m venv /venvs/bench
RUN $PIP install -U pip==19.0.3

WORKDIR /bench
VOLUME /input
VOLUME /output
VOLUME /custom

# Add the AutoML system except files listed in .dockerignore (could also use git clone directly?)
ADD . /bench/

RUN xargs -L 1 $PIP install --no-cache-dir < requirements.txt

RUN frameworks/photon/setup.sh

# https://docs.docker.com/engine/reference/builder/#entrypoint
ENTRYPOINT ["/bin/bash", "-c", "$PY runbenchmark.py $0 $*"]
CMD ["photon", "test"]

I copied the setup.sh from the autosklearn directory and updated the requirements.txt:

scikit-learn==0.21.3

When I use the Dockerfile to build the cointainer, the build process succeeds, but when I try to run it i get this error:

  File "runbenchmark.py", line 7, in <module>
    import automl.logger
  File "/bench/automl/__init__.py", line 8, in <module>
    from .benchmark import Benchmark
  File "/bench/automl/benchmark.py", line 19, in <module>
    from .openml import Openml
  File "/bench/automl/openml.py", line 9, in <module>
    import openml as oml
  File "/venvs/bench/lib/python3.6/site-packages/openml/__init__.py", line 22, in <module>
    from . import runs
  File "/venvs/bench/lib/python3.6/site-packages/openml/runs/__init__.py", line 3, in <module>
    from .functions import (run_model_on_task, run_flow_on_task, get_run, list_runs,
  File "/venvs/bench/lib/python3.6/site-packages/openml/runs/functions.py", line 21, in <module>
    from ..flows import sklearn_to_flow, get_flow, flow_exists, _check_n_jobs, \
  File "/venvs/bench/lib/python3.6/site-packages/openml/flows/__init__.py", line 3, in <module>
    from .sklearn_converter import sklearn_to_flow, flow_to_sklearn, _check_n_jobs
  File "/venvs/bench/lib/python3.6/site-packages/openml/flows/sklearn_converter.py", line 20, in <module>
    from sklearn.utils.fixes import signature
ImportError: cannot import name 'signature' 

I began digging deeper and ran the following:

>>> sklearn.__version__
'0.19.2'
>>> from sklearn.utils.fixes import signature
>>> 

And with sklearn 0.21.3

>>> sklearn.__version__
'0.21.3'
>>> from sklearn.utils.fixes import signature
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'signature'
>>> 

So it seems that the benchmark is using the same version of sklearn as my framework.

sebhrusen commented 5 years ago

@PGijsbers yes, it's a bit outdated :) The virtual env is currently used by both benchmark app and the framework itself (app just avoids changing system Py env). It's a limitation that was good enough for the paper but creates strong limitations for new frameworks as @jernsting noticed. Today, even when frameworks like TPOT run in a subprocess, this is just a fork process (using the same Py interpreter, allowing modification of os environment properties, but still sharing resources, making it easy to share objects in memory: data, config). For complete isolation, we need the framework to run in its own virtual env, but this require to spawn a completely fresh process, and sharing objects is more complicated: I did an experiment with https://github.com/openml/automlbenchmark/tree/master/examples/custom/RandomForest_standalone : it works but I was not fully convinced to integrate it. A different approach relying on multiprocessing.Manager would probably be easier to use.

@jernsting until this is fully integrated in the app, you can still have a look at https://github.com/openml/automlbenchmark/tree/master/examples/custom/RandomForest_standalone for an example where the app and the framework run in completely independents virtual env. There's much more code though: I want to abstract this to make it much easier to use but it's still on the waiting list.

jernsting commented 5 years ago

As far as i know there is an opportunity to start scripts inside a python docker container as mentioned on the docker hub (https://hub.docker.com/_/python). Wouldn't it be a better approach to generate a python file which runs the benchmark and outputs the results to a directory mounted in the cointainer? Then there would be complete isolation between the benchmark-script itself and the frameworks.

@seb-h2o you are right, this implies that the objects (the folds for example) could not be accessed directly through python. But serializing the folds and generating scripts for each classifier according to these would be better for the isolation.

sebhrusen commented 4 years ago

Closing the this ticket: this is addressed by PR https://github.com/openml/automlbenchmark/pull/60