microsoft / FLAML

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.
https://microsoft.github.io/FLAML/
MIT License
3.91k stars 508 forks source link

AutoML compatibility w/ sklearn cross-validation & roc_auc #466

Closed username725 closed 2 years ago

username725 commented 2 years ago

To perform nested cross-validation:

sklearn.model_selection.cross_val_score(automl, X, y, cv=2)

However that requires AutoML to have a score() method available. Okay, let's explicitly give sklearn a scoring method:

sklearn.model_selection.cross_val_score(automl, X, y, scoring='roc_auc', cv=2)

This ends up in sklearn's _BaseScorer._select_proba_binary() which requires classes_ to be a Numpy ndarray. AutoML explicitly is turning these to a list. So there is an error.

Full example:

import numpy as np
import sklearn
from flaml import AutoML

X = np.random.random(size=(10, 1))
y = np.random.choice([False, True], size=10)
automl = AutoML(time_budget=5)
sklearn.model_selection.cross_val_score(automl, X, y, scoring='roc_auc', cv=2)

Leads to error:

(col_idx = np.flatnonzero(classes == pos_label)[0] IndexError: index 0 is out of bounds for axis 0 with size 0).

A workaround is to override classes_ to have it return an array:

class MyAutoML(AutoML):
    @property
    def classes_(self):
        return np.array(super().classes_)

Since a workaround was found, this isn't high priority, but I wonder:

FLAML 0.9.6, scikit-learn 1.0.2

sonichi commented 2 years ago

To perform nested cross-validation:

sklearn.model_selection.cross_val_score(automl, X, y, cv=2)

However that requires AutoML to have a score() method available. Okay, let's explicitly give sklearn a scoring method:

sklearn.model_selection.cross_val_score(automl, X, y, scoring='roc_auc', cv=2)

This ends up in sklearn's _BaseScorer._select_proba_binary() which requires classes_ to be a Numpy ndarray. AutoML explicitly is turning these to a list. So there is an error.

Full example:

import numpy as np
import sklearn
from flaml import AutoML

X = np.random.random(size=(10, 1))
y = np.random.choice([False, True], size=10)
automl = AutoML(time_budget=5)
sklearn.model_selection.cross_val_score(automl, X, y, scoring='roc_auc', cv=2)

Leads to error:

(col_idx = np.flatnonzero(classes == pos_label)[0] IndexError: index 0 is out of bounds for axis 0 with size 0).

A workaround is to override classes_ to have it return an array:

class MyAutoML(AutoML):
    @property
    def classes_(self):
        return np.array(super().classes_)

Since a workaround was found, this isn't high priority, but I wonder:

  • Does a decision_function() make sense for AutoML?

Not sure because it is not applicable to all learners and tasks.

  • Does a score() function make sense?

Yes, it makes sense. Would you like to add it?

  • Compatibility reasons to .tolist() the .classes_?

We used this to make it work for automlbenchmark. Let me try converting it to np.array. If it works, we should make it compatible.

FLAML 0.9.6, scikit-learn 1.0.2

username725 commented 2 years ago

Thanks for the quick turn around. We can consider this Issue closed, and I can open an PR for score() if I get a chance.