Some of the scikit learn pipeline flows have wrong component order

openml / openml-python

OpenML's Python API for a World of Data and More 💫

http://openml.github.io/openml-python/

Other

280 stars 144 forks source link

Some of the scikit learn pipeline flows have wrong component order #476

Open dbehrouz opened 6 years ago

dbehrouz commented 6 years ago

Description

Some of the sklearn pipeline flows have the wrong component order. For example flow ids: {6969, 8330, 7694} where, in the 'component OrderedDict` the mode is the first component.

Steps/Code to Reproduce

Example:

import openml
flow = openml.flows.get_flow(6969)
' ===> '.join([c for c in flow.components.keys()])

Expected Results

u'imputation ===> hotencoding ===> variencethreshold ===> classifier '

Actual Results

u'classifier ===> imputation ===> hotencoding ===> variencethreshold'

Versions

Darwin-17.5.0-x86_64-i386-64bit ('Python', '2.7.15 (default, May 28 2018, 14:20:45) \n[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.1)]') ('NumPy', '1.13.3') ('SciPy', '1.0.0') ('Scikit-Learn', '0.18.1') ('OpenML', '0.7.0')

janvanrijn commented 6 years ago

Thanks for reporting this.

The goal of the flow / setup objects is to give a serialization of a Machine Learning object (from weka, scikit-learn, mlr) on OpenML.

Using the following code (not checked, might contain small typo's) you can deserialize it:

import openml
flow = openml.flows.get_flow(6969)
pipeline = openml.flows.flow_to_sklearn(flow)
print(pipeline.get_params()['steps'])

If the order of components is also wrong there we have a serious problem.

@mfeurer what do you think about this?

dbehrouz commented 6 years ago

@janvanrijn I just checked and the order after transforming the flow to sklearn is fine. It seems only the flow object may contain the wrong order. This is not a critical issue, but I just thought I should mention it since I was using the 'flow.components' to print the components of the flow.

janvanrijn commented 6 years ago

Thanks for checking this.

@mfeurer what do you think, should we do something about this?

mfeurer commented 6 years ago

Thanks for reporting this issue. I won't have time this week, but I will work on the OpenML connector next week, so I will also look into the issues you raised @d-behi.

janvanrijn commented 6 years ago

Actually I think this one is due to the way the flow is stored on the server

mfeurer commented 6 years ago

Checking the following piece of code, the connector indeed seems to send data correctly to the server:

import openml

import sklearn.pipeline
import sklearn.preprocessing
import sklearn.svm

flow = openml.flows.sklearn_to_flow(sklearn.pipeline.Pipeline(
    [
        ('imputation', sklearn.preprocessing.Imputer()),
        ('onehot', sklearn.preprocessing.OneHotEncoder(handle_unknown='ignore',
                                                       categorical_features=None)),
        ('variancethreshold', sklearn.feature_selection.VarianceThreshold()),
        ('classifier', sklearn.svm.SVC())
    ]
))
print(flow._to_xml())
print(' ===> '.join([c for c in flow.components.keys()]))

and the de-serialization despite the component order being wrong. Is it possible to change the way the flow is stored on the server?

janvanrijn commented 6 years ago

Yes and no. Yes, we could potentially change this.

No, I don't think that the order in which the flows are represented in the XML bears meaning, and that we should add that.