Open adfea9c0 opened 11 months ago
Thanks for the excellent report! Sorry for the long delay in responding, this project is struggling from a lack of maintainer availability.
I was able to reproduce this on the most recent commit on master
(https://github.com/microsoft/LightGBM/commit/5dfe7168d42898b66da3513eb8cab68ef2b23eeb), so it's still a problem.
Built the library like this (on an M2 mac, with Python 3.11.7)
cmake -B build -S .
cmake --build build --target _lightgbm -j4
sh build-python.sh install --precompile
And ran your example code. Saw exactly the same error you did.
I see the problem.
When you provide an init_model
, LightGBM uses it to fill out initial scores to start boosting from.
That code in the Python package has logic like "if data
is a string or pathlib.Path
, call LGBM_BoosterPredictForFile()
".
LGBM_BoosterPredictForFile()
only works with text files of raw data (CSV, TSV, or LibSVM).
So this error comes from the fact that as of this writing, LightGBM's prediction routines (in Python, R, and C) do not support generating predictions on an already-constructed Dataset
object.
In all those prior discussions about adding predict() support on the Dataset
object, I'd never considered this specific case...thanks for bringing it to our attention, with a clear and reproducible example.
Until #4546 is resolved, the best workaround I can think of is to do something like the following:
Dataset
in binary format, and the raw training dataDataset
from the raw training data using the one you loaded from that binary file as a referenceLike this
import numpy as np
import lightgbm as lgb
np.random.seed(0)
X, y = np.random.normal(size=(10_000, 20)), np.random.normal(size=(10_000,))
params = {
"verbose": -1,
"seed": 1,
"num_iterations": 10,
"bagging_freq": 1,
"bagging_fraction": 0.5
}
dataset_bin = "dataset.bin"
model_txt = "model.txt"
# save the raw training data
np.save("data.npy", X)
np.save("label.npy", y)
# train a model and save it
ds = lgb.Dataset(X, label=y, params=params)
model = lgb.train(params, train_set=ds)
model.save_model(model_txt)
model.num_trees()
# 10
# save the Dataset in binary format
ds.save_binary(dataset_bin)
# clear everything out of memory, to simulate stopping this
# process and starting a new one
del ds
del model
del X
del y
# load the Dataset and raw training data
X = np.load("data.npy")
y = np.load("label.npy")
ds = lgb.Dataset(data=dataset_bin, params=params)
# create a new Dataset, using the bin mappings from the original one
ds2 = lgb.Dataset(
data=X,
label=y,
reference=ds
)
# continue training
model = lgb.train(params, train_set=ds2, init_model=model_txt)
model.num_trees()
# 20
That's inefficient relative to being able to just use a binary Dataset
file and a model file together as in your original post. It comes with some undesirable characteristics:
Dataset
object copies of it have to all live in memory together at the same timeBUT... this should at least be faster than reconstructing a new Dataset
from the raw data. By loading the already-created one and passing it through reference=
, you are able to avoid needing to re-do the (potentially expensive) process of finding all the bin boundaries. In this pattern, LightGBM will take the bin boundaries for features from ds
and just directly convert the raw data into that binned representation in ds2
.
Realized today that there was an earlier issue documenting exactly the same thing (but in a different way, and with less details provided).
I've closed that in favor of keeping the discussion here.
see https://github.com/microsoft/LightGBM/issues/4311#issuecomment-2073900693
Description
Dataset
has asave_binary
function, and the docstring for thedata
argument inDataset
suggests that is where you should input the path to this dataset, however I cannot get this to work correctly in combination with aninit_model
.My goal here is to save both the dataset binary and the model so I can continue training later without reconstructing either the model or the dataset.
Reproducible example
Here is the setup, similar to my other bug report.
Loading and training without
init_model
goes fine:But then with
init_model
it fails -- the stack trace suggests that theinit_model
tries to read the dataset to create initial predictions but doesn't seem to be able to understand that it is a binary file:Environment info
LightGBM 4.0.0