stanfordmlgroup / ngboost

Natural Gradient Boosting for Probabilistic Prediction
Apache License 2.0
1.64k stars 215 forks source link

Base learner for NGBClassifier #225

Closed imitusov closed 3 years ago

imitusov commented 3 years ago

Am I right there is no option of a Base learner for NGBClassifier?

As soon as I pass a classifier Base learner i got the following error. At the same time if i pass learner = RandomForestRegressor(n_estimators=500, max_depth=7, min_samples_leaf=50, n_jobs=-1, max_features="sqrt", max_samples=.66, random_state=42) as a learner everything works just fine The example was taken from

from ngboost import NGBClassifier
from ngboost.distns import k_categorical, Bernoulli
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

_X, _y = load_breast_cancer(True)
_y[0:15] = 2 # artificially make this a 3-class problem instead of a 2-class problem
X_cls_train, X_cls_test, Y_cls_train, Y_cls_test  = train_test_split(_X, _y, test_size=0.2)
learner = RandomForestClassifier(n_estimators=500, max_depth=7, min_samples_leaf=50, n_jobs=-1,
    max_features="sqrt", max_samples=.66, random_state=42, class_weight='balanced')

ngb_cat = NGBClassifier(Base=learner, Dist=k_categorical(3), verbose=False) # tell ngboost that there are 3 possible outcomes
_ = ngb_cat.fit(X_cls_train, Y_cls_train) # Y should have only 3 values: {0,1,2}
ValueError                                Traceback (most recent call last)
<ipython-input-21-8eb58a515847> in <module>
     10 
     11 ngb_cat = NGBClassifier(Base=learner, Dist=k_categorical(3), verbose=False) # tell ngboost that there are 3 possible outcomes
---> 12 _ = ngb_cat.fit(X_cls_train, Y_cls_train) # Y should have only 3 values: {0,1,2}

~/.local/lib/python3.7/site-packages/ngboost/ngboost.py in fit(self, X, Y, X_val, Y_val, sample_weight, val_sample_weight, train_loss_monitor, val_loss_monitor, early_stopping_rounds)
    255             grads = D.grad(Y_batch, natural=self.natural_gradient)
    256 
--> 257             proj_grad = self.fit_base(X_batch, grads, weight_batch)
    258             scale = self.line_search(proj_grad, P_batch, Y_batch, weight_batch)
    259 

~/.local/lib/python3.7/site-packages/ngboost/ngboost.py in fit_base(self, X, grads, sample_weight)
    139     def fit_base(self, X, grads, sample_weight=None):
    140         models = [
--> 141             clone(self.Base).fit(X, g, sample_weight=sample_weight) for g in grads.T
    142         ]
    143         fitted = np.array([m.predict(X) for m in models]).T

~/.local/lib/python3.7/site-packages/ngboost/ngboost.py in <listcomp>(.0)
    139     def fit_base(self, X, grads, sample_weight=None):
    140         models = [
--> 141             clone(self.Base).fit(X, g, sample_weight=sample_weight) for g in grads.T
    142         ]
    143         fitted = np.array([m.predict(X) for m in models]).T

~/.local/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in fit(self, X, y, sample_weight)
    328         self.n_outputs_ = y.shape[1]
    329 
--> 330         y, expanded_class_weight = self._validate_y_class_weight(y)
    331 
    332         if getattr(y, "dtype", None) != DOUBLE or not y.flags.contiguous:

~/.local/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in _validate_y_class_weight(self, y)
    556 
    557     def _validate_y_class_weight(self, y):
--> 558         check_classification_targets(y)
    559 
    560         y = np.copy(y)

~/.local/lib/python3.7/site-packages/sklearn/utils/multiclass.py in check_classification_targets(y)
    170     if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
    171                       'multilabel-indicator', 'multilabel-sequences']:
--> 172         raise ValueError("Unknown label type: %r" % y_type)
    173 
    174 

ValueError: Unknown label type: 'continuous'
alejandroschuler commented 3 years ago

I think your issue is that you're trying to use a classifier as the base learner. Even though the NGBoost model as a whole is doing (probabilistic) classification, the base learner must always be a regression model. That's because the job of the base learners is to help estimate a continuous value, in this case the logits of the class probabilities. This is also the case in other boosting implementations, e.g. an xgboost classifier still uses regression trees as its base learners.