openml / openml-python

OpenML's Python API for a World of Data and More 💫
http://openml.github.io/openml-python/
Other
280 stars 144 forks source link

Allow duplicate objects in Pipeline and ColumnTransformer #638

Open PGijsbers opened 5 years ago

PGijsbers commented 5 years ago

Currently neither Pipeline nor ColumnTransformer may contain two different steps with the same type of transformer. I think this should be allowed.

Consider a scenario where I have a dataset with numeric and categorical values (e.g. feature 1 and 2, respectively), and wish to impute them with a different imputation strategy. I would use the following code (with openml on head of develop):

import openml
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
# Assume a dataset with feature 0 being numeric, and feature 1 being nominal
pipeline = Pipeline(
    [('preprocessing', ColumnTransformer(
        [('impute_numeric', SimpleImputer(strategy='mean'), [0]),
         ('impute_categorical', SimpleImputer(strategy='median'), [1])])),
     ('classifier', DecisionTreeClassifier())])
openml.flows.sklearn_to_flow(pipeline)

I would assume this should work, but it raises the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 47, in sklearn_to_flow
    rval = _serialize_model(o)
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 404, in _serialize_model
    _extract_information_from_model(model)
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 512, in _extract_information_from_model
    rval = sklearn_to_flow(v, model)
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 50, in sklearn_to_flow
    rval = [sklearn_to_flow(element, parent_model) for element in o]
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 50, in <listcomp>
    rval = [sklearn_to_flow(element, parent_model) for element in o]
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 50, in sklearn_to_flow
    rval = [sklearn_to_flow(element, parent_model) for element in o]
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 50, in <listcomp>
    rval = [sklearn_to_flow(element, parent_model) for element in o]
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 47, in sklearn_to_flow
    rval = _serialize_model(o)
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 408, in _serialize_model
    _check_multiple_occurence_of_component_in_flow(model, subcomponents)
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 490, in _check_multiple_occurence_of_component_in_flow
    'trying to serialize %s.' % (visitee.name, model))
ValueError: Found a second occurence of component sklearn.impute.SimpleImputer when trying to serialize ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('impute_numeric', SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',
       verbose=0), [0]), ('impute_categorical', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', verbose=0), [1])]).

Similarly an error is raised if a pipeline contains two steps of the same type.

What is the reason this error is raised? Is it simply not yet supported? Or should I be ordering my workflow differently, and if so, how?

janvanrijn commented 5 years ago

This is a problem of the OpenML Flow definition, as defined in the early days of OpenML (2012). There is currently no uniform way to specify to which specific instance of the flow a hyperparameter setting in a run belongs, and as such having multiple instantiations of the same subflow in a complex flow does not allow for reproducible research.

It has been on the agenda to improve this server side, however no one has started programming / testing alternatives.

PGijsbers commented 5 years ago

Thanks, that clarifies a lot. Does it make sense to leave this issue open as it will go unresolved? Or should I close it as 'we' on the package side can not fix this until the definitions are updated?

mfeurer commented 5 years ago

I think closing and referencing the corresponding issue on the OpenML issue tracker is the way to go here: https://github.com/openml/OpenML/issues/340

mfeurer commented 5 years ago

Reopening to show that this is a known issue.

PGijsbers commented 5 years ago

Marked it as wontfix because we won't (can't) fix this until we rework the flow definition.