Open iwanko opened 2 years ago
Thanks for using LightGBM!
Can you please share a minimal, reproducible example? For example, using one of the freely-available dataset from sklearn.datasets
in sciki-learn
.
@TremaMiguel 's example in https://github.com/microsoft/LightGBM/issues/4951#issue-1104736584 is a great example of a small, self-contained example used to demonstrate an issue.
Thanks very much for updating the description with a reproducible example! Excellent write-up, we really appreciate it.
I can confirm that on the most recent published version of lightgbm
(3.3.2) and on the latest commit on master
(https://github.com/microsoft/LightGBM/commit/f85dfa2c402cc42e3ecf1a960d84a9ceeac908c7), the provided code raises the following error
Cannot set predictor after freed raw data, set free_raw_data=False when construct Dataset to avoid this.
How is continued training different from the first run, if the description states that the original dataset is freed after constructing the dataset? (Not after it's first usage.)
Note that "constructed" has a special meaning in LightGBM. It doesn't mean "called lgb.Dataset()
".
LightGBM does some preprocessing like binning continuous features into histograms, dropping unsplittable features, encoding categorical features, and more. That preprocessing is what this project refers to as "constructing" a Dataset.
When you initially call lgb.Dataset()
in the Python package, the returned Python object holds information like the raw data and the parameters to use in that preprocessing. When the .construct()
method is called on that object, LightGBM passes the raw data and parameters to C++ code like LGBM_DatasetCreateFromMat()
That code initializes a LightGBM Dataset
object in memory and returns a pointer to it, which is stored in Dataset.handle
on the Python side.
Once that Dataset
object has been constructed, LightGBM no longer needs your raw input data (e.g. the numpy
array passed into lgb.Dataset()
). So, by default, it removes its copy of that data.
So, back to your example...when you first run lgb_train = lgb.Dataset(...)
, you've created a Dataset
object on the Python side, but it hasn't been "constructed" yet. The first time you use that object for training, LightGBM will "construct" it.
So if you didn't call the .construct()
method on the Dataset
before training, then it's first usage is also when it's constructed.
For now, to re-use the same Dataset for training continuation, I think you'll have to set free_raw_data=False
when first calling lgb.Dataset()
.
Looks like that is exactly what this project does in its tests for training continuation.
But I think in the future, LightGBM should support the pattern you've described above. I'm not exactly sure where to make changes, but it makes sense to me that you might want to perform continued training on the same Dataset like this.
Linking some relevant discussions: #2899, #2906
Description
If the training dataset was construcrted with free_raw_data = True, it is possible to use it only once. Trying to continue training (using init_model parameter) leads to an error:
LightGBMError: Cannot set predictor after freed raw data, set free_raw_data=False when construct Dataset to avoid this.
How is continued training different from the first run, if the description states that the original dataset is freed after constructing the dataset? (Not after it's first usage.)
Reproducible example
Environment info
LightGBM version: 3.3.1 Python version 3.9.7