stanfordmlgroup / ngboost

Natural Gradient Boosting for Probabilistic Prediction
Apache License 2.0
1.63k stars 214 forks source link

Bug while fitting NGBClassifier + GridSearchCV #242

Closed AlejandroBaron closed 3 years ago

AlejandroBaron commented 3 years ago

Hello! I'm trying to fit a NGBClassifier through hyperparameter tuning and I'm facing the following issue

from ngboost import NGBClassifier
from ngboost.distns import k_categorical, Bernoulli
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

X, y = load_breast_cancer(True)
y[0:15] = 2 # artificially make this a 3-class problem instead of a 2-class problem
X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.2)

ngb = NGBClassifier(Dist=k_categorical(3), verbose=False) # tell ngboost that there are 3 possible outcomes

b1 = DecisionTreeClassifier()
b2 = DecisionTreeClassifier()

param_grid = {
    'minibatch_frac': [1.0, 0.5],
    'Base': [b1, b2]
}

grid_search = GridSearchCV(ngb, param_grid=param_grid, cv=5,verbose=10)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)

The following error is thrown:

ValueError: Unknown label type: 'continuous'

It happens with other datasets as well, I used this one so you can have a reproducible example. When using either NGBClassifier or DecisionTreeClassifier on their own, it works like a charm (so it doesn't seem to be a type error), but when used through GridSearchCV, this error arises

alejandroschuler commented 3 years ago

interesting... I'm not sure if it's an issue with ngboost but maybe try setting y to be an integer array explicitly? Or maybe GridSearchCV doesn't work the same way with multi-category (>2) classifiers?

AlejandroBaron commented 3 years ago

interesting... I'm not sure if it's an issue with ngboost but maybe try setting y to be an integer array explicitly? Or maybe GridSearchCV doesn't work the same way with multi-category (>2) classifiers?

Tried both things and still doesn't work. It's an NGBoost interaction problem. If I use just a Decission tree classifier for instance, it works like a charm

from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

X, y = load_breast_cancer(True)
y[0:15] = 2 # artificially make this a 3-class problem instead of a 2-class problem
X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.2)

tree = DecisionTreeClassifier()

param_grid={
    "criterion":["gini","entropy"]
}

grid_search = GridSearchCV(b1, param_grid=param_grid, cv=5,verbose=10)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)

So I'd say NGBoost transforms the answer in the process, I don't know how or why though

MikeOMa commented 3 years ago

https://github.com/stanfordmlgroup/ngboost/issues/225

I think it might be the same problem as here.

You need to use a base learner which produces predictions on the real line (b1=DecisionTreeRegressor).

alejandroschuler commented 3 years ago

Good eye @MikeOMa, i didn't notice that had been changed from the example! @AlejandroBaron that is definitely the problem. I'll close the issue now but lmk if anything else comes up.