Error with string features (pandas)

marcoslbueno commented 2 years ago

I am using a classification dataset with a mixture of string and category features in a pandas dataframe, and this breaks down GAMA (see MRE below).

import openml 
from sklearn.model_selection import train_test_split
import gama

if __name__ == '__main__':
    did = 42530
    data = openml.datasets.get_dataset(did)
    X, y, _, _ = data.get_data(dataset_format='dataframe', target=data.default_target_attribute)

    X = X[y.isnull() == False]
    y = y[y.isnull() == False] 

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    print("loaded data")

    time_fold = 5*60
    metric = 'accuracy'

    clf = gama.GamaClassifier(max_total_time=time_fold, 
                            random_state = 1,
                            scoring=metric, 
                            n_jobs=1, 
                            store='nothing')

    clf.fit(X_train, y_train)
    print("finished fit.")

    proba_predictions = clf.predict_proba(X_test)
    print("finished predictions test data.")

The error I get is

loaded data
Traceback (most recent call last):
  File "mre_gama.py", line 39, in <module>
    clf.fit(X_train, y_train)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/gama/GamaClassifier.py", line 134, in fit
    super().fit(x, y, *args, **kwargs)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/gama/gama.py", line 549, in fit
    self.model = self._post_processing.post_process(
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/gama/postprocessing/best_fit.py", line 27, in post_process
    return self._selected_individual.pipeline.fit(x, y)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/joblib/memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/base.py", line 702, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/impute/_base.py", line 288, in fit
    X = self._validate_input(X, in_fit=True)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/impute/_base.py", line 260, in _validate_input
    raise new_ve from None
ValueError: Cannot use median strategy with non-numeric data:
could not convert string to float: 'Midwest'

The problem is solved when I convert the string features (in this case, 0 and 22) to category. I would think it would be best if GAMA could do this automatically, since it is an apparently simple conversion.

PGijsbers commented 2 years ago

Thanks for raising the issue! This error stems from the assumption that since Dataframes provide type annotation (their dtype), GAMA expects this to be correct (use unannotated numpy otherwise). By providing an explicitly non-categorical feature (technically object), you go against this assumption. This raises an error (although a bad and late one (#132)) because GAMA can't work with an object type series.

If you want feature type inference consider passing the data in numpy format:

- clf.fit(X_train, y_train)
+ clf.fit(X_train.values, y_train.values)

- proba_predictions = clf.predict_proba(X_test)
+ proba_predictions = clf.predict_proba(X_test.values)

By design I think it is good to assume that the user is an expert on the data: they can help the AutoML system with data type annotation. However, expanding the interface to allow for inferring pandas object series if explicitly set (e.g. infer_objects=True) sound reasonable to me. What do you think?

marcoslbueno commented 2 years ago

Thanks for replying! Indeed by using your suggestion GAMA was able to finish without errors.

I think that adding a parameter like infer_objects=True makes a lot of sense, since the user might be unsure about the column types of the dataset (even when using dataframes) and/or do not want to be checking this.

openml-labs / gama

Error with string features (pandas) #131