adfea9c0 commented 11 months ago

Description

Dataset has a save_binary function, and the docstring for the data argument in Dataset suggests that is where you should input the path to this dataset, however I cannot get this to work correctly in combination with an init_model.

My goal here is to save both the dataset binary and the model so I can continue training later without reconstructing either the model or the dataset.

Reproducible example

Here is the setup, similar to my other bug report.

import numpy as np
import lightgbm as lgb

np.random.seed(0)
X, y = np.random.normal(size=(10_000, 20)), np.random.normal(size=(10_000,))

params = {
    "verbose": -1,
    "seed": 1,
    "num_iterations": 10,
    "bagging_freq": 1,
    "bagging_fraction": 0.5
}
dataset_bin = "dataset.bin"
model_txt = "model.txt"

# Train 10 trees
ds = lgb.Dataset(X, label=y, params=params)
model = lgb.train(params, train_set=ds)
ds.save_binary(dataset_bin)
model.save_model(model_txt)
del ds
del model

Loading and training without init_model goes fine:

ds = lgb.Dataset(data=dataset_bin, params=params)
model = lgb.train(params, train_set=ds) #, init_model=model_txt)
model.num_trees()
>>> 10

But then with init_model it fails -- the stack trace suggests that the init_model tries to read the dataset to create initial predictions but doesn't seem to be able to understand that it is a binary file:

ds = lgb.Dataset(data=dataset_bin, params=params)
model = lgb.train(params, train_set=ds, init_model=model_txt)
>>> [LightGBM] [Fatal] Unknown format of training data. Only CSV, TSV, and LibSVM (zero-based) formatted text files are supported.
LightGBMError                             Traceback (most recent call last)
/tmp/ipykernel_3260742/1081693646.py in <cell line: 27>()
     25 # Train 10 more trees
     26 ds = lgb.Dataset(data=dataset_bin, params=params)
---> 27 model = lgb.train(params, train_set=ds, init_model=model_txt)
     28 model.num_trees()

/dev/shm/<redacted>/engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, feval, init_model, feature_name, categorical_feature, keep_training_booster, callbacks)
    298     # construct booster
    299     try:
--> 300         booster = Booster(params=params, train_set=train_set)
    301         if is_valid_contain_train:
    302             booster.set_train_data_name(train_data_name)

/dev/shm/<redacted>/basic.py in __init__(self, params, train_set, model_file, model_str)
   3569                 )
   3570             # construct booster object
-> 3571             train_set.construct()
   3572             # copy the parameters from train_set
   3573             params.update(train_set.get_params())

/dev/shm/<redacted>/basic.py in construct(self)
   2457             else:
   2458                 # create train
-> 2459                 self._lazy_init(
   2460                     data=self.data,
   2461                     label=self.label,

/dev/shm/<redacted>/basic.py in _lazy_init(self, data, label, reference, weight, group, init_score, predictor, feature_name, categorical_feature, params, position)
   2078                     "The init_score will be overridden by the prediction of init_model."
   2079                 )
-> 2080             self._set_init_score_by_predictor(
   2081                 predictor=predictor, data=data, used_indices=None
   2082             )

/dev/shm/<redacted>/basic.py in _set_init_score_by_predictor(self, predictor, data, used_indices)
   1914         num_data = self.num_data()
   1915         if predictor is not None:
-> 1916             init_score: Union[np.ndarray, scipy.sparse.spmatrix] = predictor.predict(
   1917                 data=data, raw_score=True, data_has_header=data_has_header
   1918             )

/dev/shm/<redacted>/basic.py in predict(self, data, start_iteration, num_iteration, raw_score, pred_leaf, pred_contrib, data_has_header, validate_features)
   1046         if isinstance(data, (str, Path)):
   1047             with _TempFile() as f:
-> 1048                 _safe_call(
   1049                     _LIB.LGBM_BoosterPredictForFile(
   1050                         self._handle,

/dev/shm/<redacted>/basic.py in _safe_call(ret)
    235     """
    236     if ret != 0:
--> 237         raise LightGBMError(_LIB.LGBM_GetLastError().decode("utf-8"))
    238
    239

LightGBMError: Unknown format of training data. Only CSV, TSV, and LibSVM (zero-based) formatted text files are supported.

Environment info

LightGBM 4.0.0

jameslamb commented 4 months ago

Thanks for the excellent report! Sorry for the long delay in responding, this project is struggling from a lack of maintainer availability.

I was able to reproduce this on the most recent commit on master (https://github.com/microsoft/LightGBM/commit/5dfe7168d42898b66da3513eb8cab68ef2b23eeb), so it's still a problem.

Built the library like this (on an M2 mac, with Python 3.11.7)

cmake -B build -S .
cmake --build build --target _lightgbm -j4
sh build-python.sh install --precompile

And ran your example code. Saw exactly the same error you did.

I see the problem.

When you provide an init_model, LightGBM uses it to fill out initial scores to start boosting from.

https://github.com/microsoft/LightGBM/blob/5dfe7168d42898b66da3513eb8cab68ef2b23eeb/python-package/lightgbm/basic.py#L2042-L2046

That code in the Python package has logic like "if data is a string or pathlib.Path, call LGBM_BoosterPredictForFile()".

https://github.com/microsoft/LightGBM/blob/5dfe7168d42898b66da3513eb8cab68ef2b23eeb/python-package/lightgbm/basic.py#L1150-L1163

LGBM_BoosterPredictForFile() only works with text files of raw data (CSV, TSV, or LibSVM).

https://github.com/microsoft/LightGBM/blob/5dfe7168d42898b66da3513eb8cab68ef2b23eeb/src/io/parser.cpp#L263-L266

So this error comes from the fact that as of this writing, LightGBM's prediction routines (in Python, R, and C) do not support generating predictions on an already-constructed Dataset object.

4546 is the main feature request tracking that work.

5191 could also help in the Python package specifically, as an inefficient workaround.

In all those prior discussions about adding predict() support on the Dataset object, I'd never considered this specific case...thanks for bringing it to our attention, with a clear and reproducible example.

jameslamb commented 4 months ago

Until #4546 is resolved, the best workaround I can think of is to do something like the following:

save the model, the Dataset in binary format, and the raw training data
in the new process, load those all back into memory
create a new Dataset from the raw training data using the one you loaded from that binary file as a reference

Like this

import numpy as np
import lightgbm as lgb

np.random.seed(0)
X, y = np.random.normal(size=(10_000, 20)), np.random.normal(size=(10_000,))

params = {
    "verbose": -1,
    "seed": 1,
    "num_iterations": 10,
    "bagging_freq": 1,
    "bagging_fraction": 0.5
}
dataset_bin = "dataset.bin"
model_txt = "model.txt"

# save the raw training data
np.save("data.npy", X)
np.save("label.npy", y)

# train a model and save it
ds = lgb.Dataset(X, label=y, params=params)
model = lgb.train(params, train_set=ds)
model.save_model(model_txt)
model.num_trees()
# 10

# save the Dataset in binary format
ds.save_binary(dataset_bin)

# clear everything out of memory, to simulate stopping this
# process and starting a new one
del ds
del model
del X
del y

# load the Dataset and raw training data
X = np.load("data.npy")
y = np.load("label.npy")
ds = lgb.Dataset(data=dataset_bin, params=params)

# create a new Dataset, using the bin mappings from the original one
ds2 = lgb.Dataset(
    data=X,
    label=y,
    reference=ds
)

# continue training
model = lgb.train(params, train_set=ds2, init_model=model_txt)
model.num_trees()
# 20

That's inefficient relative to being able to just use a binary Dataset file and a model file together as in your original post. It comes with some undesirable characteristics:

have to spend CPU and disk space serializing / deserializing the raw data
in the training-continuation process, the raw data and 2 Dataset object copies of it have to all live in memory together at the same time

BUT... this should at least be faster than reconstructing a new Dataset from the raw data. By loading the already-created one and passing it through reference=, you are able to avoid needing to re-do the (potentially expensive) process of finding all the bin boundaries. In this pattern, LightGBM will take the bin boundaries for features from ds and just directly convert the raw data into that binned representation in ds2.

jameslamb commented 4 months ago

Realized today that there was an earlier issue documenting exactly the same thing (but in a different way, and with less details provided).

I've closed that in favor of keeping the discussion here.

see https://github.com/microsoft/LightGBM/issues/4311#issuecomment-2073900693

microsoft / LightGBM

Reloading dataset broken with init_model #6144

Description

Reproducible example

Environment info

4546 is the main feature request tracking that work.

5191 could also help in the Python package specifically, as an inefficient workaround.