microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.72k stars 3.84k forks source link

Predicted probabilities are different in `lightgbm 2.3.1` and `lightgbm 3.0.0` #4583

Closed alfaro96 closed 3 years ago

alfaro96 commented 3 years ago

Description

We are having an issue in scikit-learn because the predicted probabilities between lightgbm 2.3.1 and lightgbm 3.0.0 are different for the LGBMClassifier, and I have not found any breaking change in the release notes. Our equivalence tests between HistGradientBoostingClassifier and LGBMClassifier were passing with lightgbm 2.3.1, but now are failing with lightgbm 3.0.0.

There has been any breaking change that we have not noticed?

Reproducible example

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from lightgbm import LGBMClassifier
import numpy as np

rng = np.random.RandomState(seed=0)

X, y = make_classification(
    n_samples=255,
    n_classes=3,
    n_features=5,
    n_informative=5,
    n_redundant=0,
    n_clusters_per_class=1,
    random_state=0,
)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)

est_lightgbm = LGBMClassifier(
    boost_from_average=True,
    enable_bundle=False,
    learning_rate=2,
    max_bin=255,
    max_depth=None,
    min_child_samples=1,
    min_data_in_bin=1,
    min_split_gain=0,
    min_sum_hessian_in_leaf=0.002,
    n_estimators=1,
    num_leaves=4096,
    objective='multiclass',
    verbosity=-10)

est_lightgbm.fit(X_train, y_train)
proba_lightgbm = est_lightgbm.predict_proba(X_train)

With lightgbm 2.3.1, proba_lightgbm is:

array([[0.01069576, 0.01061513, 0.97868911],
       [0.01069576, 0.97868911, 0.01061513],
       ...,
       [0.97743991, 0.01128004, 0.01128004],
       [0.01069576, 0.97868911, 0.01061513]])

And with lightgbm 3.0.0:

array([[0.00238864, 0.00238943, 0.99522193],
       [0.00238864, 0.99522193, 0.00238943],
       ...,
       [0.99475717, 0.00262141, 0.00262141],
       [0.00238864, 0.99522193, 0.00238943]])

Environment info

lightgbm version: 2.3.1 and 3.0.0.

Commands you used to install lightgbm

pip install lightgbm==2.3.1

and

pip install lightgbm==3.0.0

jameslamb commented 3 years ago

Thanks very much for your question. Are you able to create a reproducible example that imports LGBMClassifier directly from lightgbm, instead of using sklearn.ensemble._hist_gradient_boosting.utils.get_equivalent_estimator? That would eliminate a layer of indirection and make this discussion a bit more focused.

jmoralez commented 3 years ago

I believe it could be a conflict between this https://github.com/microsoft/LightGBM/blob/4ee6399db460644c14847662e1ccd55ebb026c17/src/objective/multiclass_objective.hpp#L31 and this:

if sklearn_params['loss'] == 'categorical_crossentropy':
    # LightGBM multiplies hessians by 2 in multiclass loss.
    lightgbm_params['min_sum_hessian_in_leaf'] *= 2
    lightgbm_params['learning_rate'] *= 2
alfaro96 commented 3 years ago

Thanks very much for your question. Are you able to create a reproducible example that imports LGBMClassifier directly from lightgbm, instead of using sklearn.ensemble._hist_gradient_boosting.utils.get_equivalent_estimator? That would eliminate a layer of indirection and make this discussion a bit more focused.

Thank you @jameslamb for the quick reply. I have modified the original reproducible example to eliminate the indirection.

jmoralez commented 3 years ago

I'm able to reproduce the values from 2.3.1 by setting learning_rate=3/2, which is the factor that was introduced before the 3.0.0 release, hope this helps.

alfaro96 commented 3 years ago

I will close the issue since it seems to be solved.

Thank you @jmoralez and @jameslamb for the quick replies!

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.