Open eli-osherovich opened 2 years ago
Thanks very much for using LightGBM and for reporting this issue!
Can you please try to provide a minimal, reproducible example that maintainers can run to reproduce the behavior you're seeing? Ideally using a dataset from sklearn.datasets
or one created with pure numpy
or pandas
code.
The example I've created below shows lightgbm
handling the case "early stopping was not triggered" correctly.
Did not meet early stopping. Best iteration is: [10] sparkly-unicorn's increasing_metric: 0.2
import lightgbm as lgb
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(n_samples=10_000, n_features=10, n_informative=5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
dtrain = lgb.Dataset(data=X_train, label=y_train)
dtest = lgb.Dataset(data=X_test, label=y_test, reference=dtrain)
metric_value = 0.1
def _increasing_metric(preds, labeled_data):
global metric_value
metric_value += 0.01
name = "increasing_mmetric"
higher_better = True
return name, 0.0 + metric_value, higher_better
evals_result = {}
bst = lgb.train(
train_set=dtrain,
params={
"early_stopping_rounds": 2,
"objective": "regression_l2",
"metric": "None",
"num_iterations": 10,
"num_leaves": 8,
"verbose": 1
},
valid_sets=[dtrain],
valid_names=["sparkly-unicorn"],
evals_result=evals_result,
feval=_increasing_metric
)
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001404 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2550
[LightGBM] [Info] Number of data points in the train set: 9000, number of used features: 10
[LightGBM] [Info] Start training from score -2.465802
[1] sparkly-unicorn's increasing_metric: 0.11
Training until validation scores don't improve for 2 rounds
[2] sparkly-unicorn's increasing_metric: 0.12
[3] sparkly-unicorn's increasing_metric: 0.13
[4] sparkly-unicorn's increasing_metric: 0.14
[5] sparkly-unicorn's increasing_metric: 0.15
[6] sparkly-unicorn's increasing_metric: 0.16
[7] sparkly-unicorn's increasing_metric: 0.17
[8] sparkly-unicorn's increasing_metric: 0.18
[9] sparkly-unicorn's increasing_metric: 0.19
[10] sparkly-unicorn's increasing_metric: 0.2
Did not meet early stopping. Best iteration is:
[10] sparkly-unicorn's increasing_metric: 0.2
I observed the same behavior with lightgbm
3.3.1 and on the current state of master
(https://github.com/microsoft/LightGBM/commit/aa413f7ed93a2d29f5d9ffad21c6ee8de9874228).
You are absolutely right.
I run a lot of different model. After checking where exactly this behavior happens I found out that this is due to dart
boosting.
@jameslamb slightly changing your example will demonstrate the issue: best iteration = 0 (which is wrong) best score = 0.2 (which is wrong)
import lightgbm as lgb
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(n_samples=10_000, n_features=10, n_informative=5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
dtrain = lgb.Dataset(data=X_train, label=y_train)
dtest = lgb.Dataset(data=X_test, label=y_test, reference=dtrain)
metric_value = 0.1
def _decreasing_metric(preds, labeled_data):
global metric_value
metric_value += 0.01
name = "decreasing_metric"
higher_better = False
return name, 0.0 + metric_value, higher_better
evals_result = {}
bst = lgb.train(
train_set=dtrain,
params={
"boosting": "dart",
"objective": "regression_l2",
"metric": "None",
"num_iterations": 10,
"num_leaves": 8,
"verbose": 1
},
valid_sets=[dtrain],
valid_names=["sparkly-unicorn"],
evals_result=evals_result,
feval=_decreasing_metric
)
print(f"Best iteration: {bst.best_iteration}")
print(f"Best score: {bst.best_score}")
The code will print:
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000537 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2550
[LightGBM] [Info] Number of data points in the train set: 9000, number of used features: 10
[LightGBM] [Info] Start training from score -0.009100
[1] sparkly-unicorn's decreasing_metric: 0.11
[2] sparkly-unicorn's decreasing_metric: 0.12
[3] sparkly-unicorn's decreasing_metric: 0.13
[4] sparkly-unicorn's decreasing_metric: 0.14
[5] sparkly-unicorn's decreasing_metric: 0.15
[6] sparkly-unicorn's decreasing_metric: 0.16
[7] sparkly-unicorn's decreasing_metric: 0.17
[8] sparkly-unicorn's decreasing_metric: 0.18
[9] sparkly-unicorn's decreasing_metric: 0.19
[10] sparkly-unicorn's decreasing_metric: 0.2
Best iteration: 0
Best score: defaultdict(<class 'collections.OrderedDict'>, {'sparkly-unicorn': OrderedDict([('decreasing_metric', 0.20000000000000007)])})
@eli-osherovich You set higher_better = False
, so
best iteration = 0
best score = 0.2
are correct results.
@StrikerRUS Good catch, but does not change the fact that DART boosting returns wrong numbers It seems to always return best iteration 0 and best score = the last iteration's score.
P. S.
Check yourself -- are the results really correct???
When did the system get to the 0.2
score? At iteration 0?
@StrikerRUS , @jameslamb
Here is an updated example:
import lightgbm as lgb
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import random
X, y = make_regression(n_samples=10_000, n_features=10, n_informative=5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
dtrain = lgb.Dataset(data=X_train, label=y_train)
dtest = lgb.Dataset(data=X_test, label=y_test, reference=dtrain)
metric_value = 0.1
def _decreasing_metric(preds, labeled_data):
global metric_value
metric_value += 0.01
name = "decreasing_metric"
higher_better = False
return name, 0.0 + random.random(), higher_better
evals_result = {}
bst = lgb.train(
train_set=dtrain,
params={
"boosting": "dart",
"objective": "regression_l2",
"metric": "None",
"num_iterations": 10,
"num_leaves": 8,
"verbose": 1
},
valid_sets=[dtrain],
valid_names=["sparkly-unicorn"],
evals_result=evals_result,
feval=_decreasing_metric
)
print(f"Best iteration: {bst.best_iteration}")
print(f"Best score: {bst.best_score}")
Which produces
[1] sparkly-unicorn's decreasing_metric: 0.196159
[2] sparkly-unicorn's decreasing_metric: 0.896386
[3] sparkly-unicorn's decreasing_metric: 0.51095
[4] sparkly-unicorn's decreasing_metric: 0.162775
[5] sparkly-unicorn's decreasing_metric: 0.936386
[6] sparkly-unicorn's decreasing_metric: 0.356681
[7] sparkly-unicorn's decreasing_metric: 0.139057
[8] sparkly-unicorn's decreasing_metric: 0.360889
[9] sparkly-unicorn's decreasing_metric: 0.723269
[10] sparkly-unicorn's decreasing_metric: 0.435345
Best iteration: 0
Best score: defaultdict(<class 'dict'>, {'sparkly-unicorn': {'decreasing_metric': 0.4353452856052513}})
P. S. I wonder if you really do not see a problem here, @StrikerRUS ?
@eli-osherovich Thanks for your clarification!
Good catch, but does not change the fact that DART boosting returns wrong numbers
Early stopping doesn't work with DART mode. You should get a warning about it (probably, you globally turned them off). Please refer to #1893 and https://github.com/microsoft/LightGBM/blob/90a71b9403e7facf52e2973ccdd6403a8071c898/python-package/lightgbm/callback.py#L238
@StrikerRUS Honestly, I do not see the reason why. But even without early stopping those number are wrong. Both best iteration and best score. Best score can be understood -- this is due to the fact that DART always returns the last version (does it?). But best iteration is completely off.
P. S. Despite the title of the question, my example above does not use early stopping.
Honestly, I do not see the reason why.
The reason is when using dart, the previous trees will be updated. For example, in your case, although iteration 34 is best, these trees are changed in the later iterations, as dart will update the previous trees. https://github.com/microsoft/LightGBM/issues/1893#issuecomment-444803315
But even without early stopping those number are wrong. Both best iteration and best score.
Best iteration and best score are set only when early stopping is enabled.
this is due to the fact that DART always returns the last version
You shouldn't use DART with early stopping.
@StrikerRUS I understand that DART changes trees. But I do not understand how this prevents it keeping the best tree. As long as the system can save models it should be able to keep the best one.
Description
When the fitting is terminated by reaching the maximal allowed number of trees, the best score and best iteration are not set correctly... In the example below I use
early_stopping
callback that does not stop training since the maximal number of trees reached before the improvement stopped.Reproducible example
Example output. Note that both the best iteration (
model.best_iteration
) and best score (model.best_score['valid']
) are incorrect.Environment info
LightGBM version or commit hash:
3.3.1
Command(s) you used to install LightGBMAdditional Comments