LightGBM incorrectly reports best score/iteration

eli-osherovich commented 2 years ago

Description

When the fitting is terminated by reaching the maximal allowed number of trees, the best score and best iteration are not set correctly... In the example below I use early_stopping callback that does not stop training since the maximal number of trees reached before the improvement stopped.

Reproducible example

Example output. Note that both the best iteration (model.best_iteration) and best score (model.best_score['valid']) are incorrect.

[17:34:04] [2700]   valid's auc: 0.757849
[17:34:28] [2800]   valid's auc: 0.75756
[17:34:58] [2900]   valid's auc: 0.757639
[17:35:36] [3000]   valid's auc: 0.757364
[17:35:37] Best validation score: 
[17:35:37]     auc : 0.7573635682370727
[17:35:37] Best iteration: 0

Environment info

LightGBM version or commit hash: 3.3.1 Command(s) you used to install LightGBM

conda install lightgbm

Additional Comments

jameslamb commented 2 years ago

Thanks very much for using LightGBM and for reporting this issue!

Can you please try to provide a minimal, reproducible example that maintainers can run to reproduce the behavior you're seeing? Ideally using a dataset from sklearn.datasets or one created with pure numpy or pandas code.

The example I've created below shows lightgbm handling the case "early stopping was not triggered" correctly.

Did not meet early stopping. Best iteration is: [10] sparkly-unicorn's increasing_metric: 0.2

import lightgbm as lgb
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples=10_000, n_features=10, n_informative=5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

dtrain = lgb.Dataset(data=X_train, label=y_train)
dtest = lgb.Dataset(data=X_test, label=y_test, reference=dtrain)

metric_value = 0.1

def _increasing_metric(preds, labeled_data):
    global metric_value
    metric_value += 0.01
    name = "increasing_mmetric"
    higher_better = True
    return name, 0.0 + metric_value, higher_better

evals_result = {}

bst = lgb.train(
    train_set=dtrain,
    params={
        "early_stopping_rounds": 2,
        "objective": "regression_l2",
        "metric": "None",
        "num_iterations": 10,
        "num_leaves": 8,
        "verbose": 1
    },
    valid_sets=[dtrain],
    valid_names=["sparkly-unicorn"],
    evals_result=evals_result,
    feval=_increasing_metric
)

[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001404 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2550
[LightGBM] [Info] Number of data points in the train set: 9000, number of used features: 10
[LightGBM] [Info] Start training from score -2.465802
[1] sparkly-unicorn's increasing_metric: 0.11
Training until validation scores don't improve for 2 rounds
[2] sparkly-unicorn's increasing_metric: 0.12
[3] sparkly-unicorn's increasing_metric: 0.13
[4] sparkly-unicorn's increasing_metric: 0.14
[5] sparkly-unicorn's increasing_metric: 0.15
[6] sparkly-unicorn's increasing_metric: 0.16
[7] sparkly-unicorn's increasing_metric: 0.17
[8] sparkly-unicorn's increasing_metric: 0.18
[9] sparkly-unicorn's increasing_metric: 0.19
[10]    sparkly-unicorn's increasing_metric: 0.2
Did not meet early stopping. Best iteration is:
[10]    sparkly-unicorn's increasing_metric: 0.2

I observed the same behavior with lightgbm 3.3.1 and on the current state of master (https://github.com/microsoft/LightGBM/commit/aa413f7ed93a2d29f5d9ffad21c6ee8de9874228).

eli-osherovich commented 2 years ago

You are absolutely right.

I run a lot of different model. After checking where exactly this behavior happens I found out that this is due to dart boosting.

eli-osherovich commented 2 years ago

@jameslamb slightly changing your example will demonstrate the issue: best iteration = 0 (which is wrong) best score = 0.2 (which is wrong)

import lightgbm as lgb
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples=10_000, n_features=10, n_informative=5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

dtrain = lgb.Dataset(data=X_train, label=y_train)
dtest = lgb.Dataset(data=X_test, label=y_test, reference=dtrain)

metric_value = 0.1

def _decreasing_metric(preds, labeled_data):
    global metric_value
    metric_value += 0.01
    name = "decreasing_metric"
    higher_better = False
    return name, 0.0 + metric_value, higher_better

evals_result = {}

bst = lgb.train(
    train_set=dtrain,
    params={
        "boosting": "dart",
        "objective": "regression_l2",
        "metric": "None",
        "num_iterations": 10,
        "num_leaves": 8,
        "verbose": 1
    },
    valid_sets=[dtrain],
    valid_names=["sparkly-unicorn"],
    evals_result=evals_result,
    feval=_decreasing_metric
)
print(f"Best iteration: {bst.best_iteration}")
print(f"Best score: {bst.best_score}")

The code will print:

[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000537 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2550
[LightGBM] [Info] Number of data points in the train set: 9000, number of used features: 10
[LightGBM] [Info] Start training from score -0.009100
[1] sparkly-unicorn's decreasing_metric: 0.11
[2] sparkly-unicorn's decreasing_metric: 0.12
[3] sparkly-unicorn's decreasing_metric: 0.13
[4] sparkly-unicorn's decreasing_metric: 0.14
[5] sparkly-unicorn's decreasing_metric: 0.15
[6] sparkly-unicorn's decreasing_metric: 0.16
[7] sparkly-unicorn's decreasing_metric: 0.17
[8] sparkly-unicorn's decreasing_metric: 0.18
[9] sparkly-unicorn's decreasing_metric: 0.19
[10]    sparkly-unicorn's decreasing_metric: 0.2
Best iteration: 0
Best score: defaultdict(<class 'collections.OrderedDict'>, {'sparkly-unicorn': OrderedDict([('decreasing_metric', 0.20000000000000007)])})

StrikerRUS commented 2 years ago

@eli-osherovich You set higher_better = False, so

best iteration = 0
best score = 0.2

are correct results.

eli-osherovich commented 2 years ago

@StrikerRUS Good catch, but does not change the fact that DART boosting returns wrong numbers It seems to always return best iteration 0 and best score = the last iteration's score.

P. S. Check yourself -- are the results really correct??? When did the system get to the 0.2 score? At iteration 0?

eli-osherovich commented 2 years ago

@StrikerRUS , @jameslamb

Here is an updated example:

import lightgbm as lgb
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import random

X, y = make_regression(n_samples=10_000, n_features=10, n_informative=5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

dtrain = lgb.Dataset(data=X_train, label=y_train)
dtest = lgb.Dataset(data=X_test, label=y_test, reference=dtrain)

metric_value = 0.1

def _decreasing_metric(preds, labeled_data):
    global metric_value
    metric_value += 0.01
    name = "decreasing_metric"
    higher_better = False
    return name, 0.0 + random.random(), higher_better

evals_result = {}

bst = lgb.train(
    train_set=dtrain,
    params={
        "boosting": "dart",
        "objective": "regression_l2",
        "metric": "None",
        "num_iterations": 10,
        "num_leaves": 8,
        "verbose": 1
    },
    valid_sets=[dtrain],
    valid_names=["sparkly-unicorn"],
    evals_result=evals_result,
    feval=_decreasing_metric
)
print(f"Best iteration: {bst.best_iteration}")
print(f"Best score: {bst.best_score}")

Which produces

[1] sparkly-unicorn's decreasing_metric: 0.196159
[2] sparkly-unicorn's decreasing_metric: 0.896386
[3] sparkly-unicorn's decreasing_metric: 0.51095
[4] sparkly-unicorn's decreasing_metric: 0.162775
[5] sparkly-unicorn's decreasing_metric: 0.936386
[6] sparkly-unicorn's decreasing_metric: 0.356681
[7] sparkly-unicorn's decreasing_metric: 0.139057
[8] sparkly-unicorn's decreasing_metric: 0.360889
[9] sparkly-unicorn's decreasing_metric: 0.723269
[10]    sparkly-unicorn's decreasing_metric: 0.435345
Best iteration: 0
Best score: defaultdict(<class 'dict'>, {'sparkly-unicorn': {'decreasing_metric': 0.4353452856052513}})

P. S. I wonder if you really do not see a problem here, @StrikerRUS ?

StrikerRUS commented 2 years ago

@eli-osherovich Thanks for your clarification!

Good catch, but does not change the fact that DART boosting returns wrong numbers

Early stopping doesn't work with DART mode. You should get a warning about it (probably, you globally turned them off). Please refer to #1893 and https://github.com/microsoft/LightGBM/blob/90a71b9403e7facf52e2973ccdd6403a8071c898/python-package/lightgbm/callback.py#L238

eli-osherovich commented 2 years ago

@StrikerRUS Honestly, I do not see the reason why. But even without early stopping those number are wrong. Both best iteration and best score. Best score can be understood -- this is due to the fact that DART always returns the last version (does it?). But best iteration is completely off.

P. S. Despite the title of the question, my example above does not use early stopping.

StrikerRUS commented 2 years ago

Honestly, I do not see the reason why.

The reason is when using dart, the previous trees will be updated. For example, in your case, although iteration 34 is best, these trees are changed in the later iterations, as dart will update the previous trees. https://github.com/microsoft/LightGBM/issues/1893#issuecomment-444803315

But even without early stopping those number are wrong. Both best iteration and best score.

Best iteration and best score are set only when early stopping is enabled.

this is due to the fact that DART always returns the last version

You shouldn't use DART with early stopping.

eli-osherovich commented 2 years ago

@StrikerRUS I understand that DART changes trees. But I do not understand how this prevents it keeping the best tree. As long as the system can save models it should be able to keep the best one.

microsoft / LightGBM