microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.55k stars 3.82k forks source link

Cannot add validation data if the validation data is loaded from a file. #2552

Closed kenkoooo closed 4 years ago

kenkoooo commented 4 years ago

Environment info

Operating System: Ubuntu 18.04 CPU/GPU model: C++/Python/R version: Python 3.7

LightGBM version or commit hash: 2.3.0

Error message

[LightGBM] [Fatal] Cannot add validation data, since it has different bin mappers with training data
Traceback (most recent call last):
  File "bug.py", line 34, in <module>
    model = lgb.train(params, train_set=d_train, valid_sets=d_valid, early_stopping_rounds=100)
  File "/home/knakamura/.pyenv/versions/ashrae/lib/python3.7/site-packages/lightgbm/engine.py", line 232, in train
    booster.add_valid(valid_set, name_valid_set)
  File "/home/knakamura/.pyenv/versions/ashrae/lib/python3.7/site-packages/lightgbm/basic.py", line 1845, in add_valid
    data.construct().handle))
  File "/home/knakamura/.pyenv/versions/ashrae/lib/python3.7/site-packages/lightgbm/basic.py", line 47, in _safe_call
    raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
lightgbm.basic.LightGBMError: Cannot add validation data, since it has different bin mappers with training data

Reproducible examples

The following one works:

d_train = lgb.Dataset(data=X_train, label=y_train)
d_valid = lgb.Dataset(data=X_valid, label=y_valid, reference=d_train)

# d_train.save_binary("train.bin")
# d_valid.save_binary("valid.bin")
#
# d_train = lgb.Dataset("train.bin")
# d_valid = lgb.Dataset("valid.bin", reference=d_train)

model = lgb.train(params, train_set=d_train, valid_sets=d_valid, early_stopping_rounds=100)

But the following one doesn't work:

d_train = lgb.Dataset(data=X_train, label=y_train)
d_valid = lgb.Dataset(data=X_valid, label=y_valid, reference=d_train)

d_train.save_binary("train.bin")
d_valid.save_binary("valid.bin")

d_train = lgb.Dataset("train.bin")
d_valid = lgb.Dataset("valid.bin", reference=d_train)

model = lgb.train(params, train_set=d_train, valid_sets=d_valid, early_stopping_rounds=100)
kenkoooo commented 4 years ago

It worked. I don't know why I failed a lot of times. :disappointed:

nonsignificantp commented 4 years ago

I'm currently having the same problem with a similar hardward/software configuration. Any clue why is this happening?

Update Never mind, now I get it. This error is related to splitting categorical columns. You'll get this error when the train or validation set contains categorical values that do not appear in the other one.