Open janvanrijn opened 1 year ago
Could you please provide us with the output of
import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import sklearn; print("Scikit-Learn", sklearn.__version__)
import openml; print("OpenML", openml.__version__)
so we know the versions of scikit-learn and OpenML-Python?
~+1 for info, cannot reproduce this locally on a fresh install.~ Wait. Are you talking about the scikit-learn error from the line run = openml.runs.run_model_on_task(clf, task)
? i.e. ValueError: could not convert string to float: 'ZS'
? That is because you changed the dataset from the example. The provided scikit-learn pipeline can not handle string data, it would need an encoder for that.
Sidenote: I noticed that task 32 is not actually credit-g (opened as separate issue #1229).
Hereby the version info:
Linux-5.19.0-35-generic-x86_64-with-glibc2.35
Python 3.10.9 (main, Mar 8 2023, 10:47:38) [GCC 11.2.0]
NumPy 1.23.5
SciPy 1.10.0
Scikit-Learn 1.2.2
OpenML 0.13.0
This is indeed the error ValueError: could not convert string to float: 'ZS'
. Note that this is not a string value, but a categorical value. This AFAIK this is not dataset specific. I had similar issues on the live server in the OpenML-CC18. When I use task 7 on the test server (kr-vs-kp) I have similar issues: ValueError: could not convert string to float: 'f'
.
I know that it is preferred to do OneHotEncoding, but in the past it worked also without (or, for example, when using first imputation and then hotencoding, this error occurs).
There are also examples which work with categorical data, e.g., this pipeline from the docs, is it possible you mixed them up? As far as I am aware, openml-python
never did any imputation or encoding itself, so then the only explanation would be that scikit-learn changed (though I'm not aware of any changes in scikit-learn that would explain the change).
Example for running a pipeline on kr-vs-kp:
import openml
from sklearn import pipeline, compose, preprocessing, impute, ensemble, tree
# OpenML helper functions for sklearn can be plugged in directly for complicated pipelines
from openml.extensions.sklearn import cat, cont
openml.config.start_using_configuration_for_example()
task = openml.tasks.get_task(7)
pipe = pipeline.Pipeline(
steps=[
(
"Preprocessing",
compose.ColumnTransformer(
[
(
"categorical",
preprocessing.OneHotEncoder(sparse=False, handle_unknown="ignore"),
cat, # returns the categorical feature indices
),
(
"continuous",
impute.SimpleImputer(strategy="median"),
cont,
), # returns the numeric feature indices
]
),
),
("Classifier", tree.DecisionTreeClassifier()),
]
)
run = openml.runs.run_model_on_task(pipe, task, avoid_duplicate_runs=False)
I think @PGijsbers's statement and code are a potential solution to this issue.
Do you know if this resolved your problem, @janvanrijn? Or is this still a problem with openml-python
that I could look into?
the following code crashes when applying on datasets with categorical attributes (comes from the examples)
@mfeurer @prabhant @PGijsbers