Problem with datasets with categorical attributes

janvanrijn commented 1 year ago

the following code crashes when applying on datasets with categorical attributes (comes from the examples)

@mfeurer @prabhant @PGijsbers

import openml
from sklearn import impute, tree, pipeline

# Define a scikit-learn classifier or pipeline
clf = pipeline.Pipeline(
    steps=[
        ('imputer', impute.SimpleImputer(strategy='constant', fill_value=-1)),
        ('estimator', tree.DecisionTreeClassifier())
    ]
)
openml.config.server = 'https://test.openml.org/api/v1/'
openml.config.apikey = 'removed'

# Download the OpenML task for the german credit card dataset with 10-fold
# cross-validation.
task = openml.tasks.get_task(1) # anneal dataset has categorical atts
# Run the scikit-learn model on the task.
run = openml.runs.run_model_on_task(clf, task)

mfeurer commented 1 year ago

Could you please provide us with the output of

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import sklearn; print("Scikit-Learn", sklearn.__version__)
import openml; print("OpenML", openml.__version__)

so we know the versions of scikit-learn and OpenML-Python?

PGijsbers commented 1 year ago

~+1 for info, cannot reproduce this locally on a fresh install.~ Wait. Are you talking about the scikit-learn error from the line run = openml.runs.run_model_on_task(clf, task)? i.e. ValueError: could not convert string to float: 'ZS'? That is because you changed the dataset from the example. The provided scikit-learn pipeline can not handle string data, it would need an encoder for that.

PGijsbers commented 1 year ago

Sidenote: I noticed that task 32 is not actually credit-g (opened as separate issue #1229).

janvanrijn commented 1 year ago

Hereby the version info:

Linux-5.19.0-35-generic-x86_64-with-glibc2.35
Python 3.10.9 (main, Mar  8 2023, 10:47:38) [GCC 11.2.0]
NumPy 1.23.5
SciPy 1.10.0
Scikit-Learn 1.2.2
OpenML 0.13.0

This is indeed the error ValueError: could not convert string to float: 'ZS'. Note that this is not a string value, but a categorical value. This AFAIK this is not dataset specific. I had similar issues on the live server in the OpenML-CC18. When I use task 7 on the test server (kr-vs-kp) I have similar issues: ValueError: could not convert string to float: 'f'.

I know that it is preferred to do OneHotEncoding, but in the past it worked also without (or, for example, when using first imputation and then hotencoding, this error occurs).

PGijsbers commented 1 year ago

There are also examples which work with categorical data, e.g., this pipeline from the docs, is it possible you mixed them up? As far as I am aware, openml-python never did any imputation or encoding itself, so then the only explanation would be that scikit-learn changed (though I'm not aware of any changes in scikit-learn that would explain the change).

Example for running a pipeline on kr-vs-kp:

import openml
from sklearn import pipeline, compose, preprocessing, impute, ensemble, tree

# OpenML helper functions for sklearn can be plugged in directly for complicated pipelines
from openml.extensions.sklearn import cat, cont

openml.config.start_using_configuration_for_example()

task = openml.tasks.get_task(7)

pipe = pipeline.Pipeline(
    steps=[
        (
            "Preprocessing",
            compose.ColumnTransformer(
                [
                    (
                        "categorical",
                        preprocessing.OneHotEncoder(sparse=False, handle_unknown="ignore"),
                        cat,  # returns the categorical feature indices
                    ),
                    (
                        "continuous",
                        impute.SimpleImputer(strategy="median"),
                        cont,
                    ),  # returns the numeric feature indices
                ]
            ),
        ),
        ("Classifier", tree.DecisionTreeClassifier()),
    ]
)

run = openml.runs.run_model_on_task(pipe, task, avoid_duplicate_runs=False)

LennartPurucker commented 1 year ago

I think @PGijsbers's statement and code are a potential solution to this issue. Do you know if this resolved your problem, @janvanrijn? Or is this still a problem with openml-python that I could look into?

openml / openml-python

Problem with datasets with categorical attributes #1228