[python] Reusing dataset constructed with free_raw_data = True isn't possible. Intended behaviour?

iwanko commented 2 years ago

Description

If the training dataset was construcrted with free_raw_data = True, it is possible to use it only once. Trying to continue training (using init_model parameter) leads to an error:

LightGBMError: Cannot set predictor after freed raw data, set free_raw_data=False when construct Dataset to avoid this.

How is continued training different from the first run, if the description states that the original dataset is freed after constructing the dataset? (Not after it's first usage.)

Reproducible example

import numpy as np
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)
lgb_train = lgb.Dataset(X, y, init_score=np.zeros(y.size))
train_params = {
    'objective': 'binary',
    'verbose': -1,
    'seed': 42
}

model = lgb.train(train_params, lgb_train, num_boost_round=10)
model = lgb.train(train_params, lgb_train, num_boost_round=10, init_model = model)

Environment info

LightGBM version: 3.3.1 Python version 3.9.7

jameslamb commented 2 years ago

Thanks for using LightGBM!

Can you please share a minimal, reproducible example? For example, using one of the freely-available dataset from sklearn.datasets in sciki-learn.

@TremaMiguel 's example in https://github.com/microsoft/LightGBM/issues/4951#issue-1104736584 is a great example of a small, self-contained example used to demonstrate an issue.

jameslamb commented 2 years ago

Thanks very much for updating the description with a reproducible example! Excellent write-up, we really appreciate it.

I can confirm that on the most recent published version of lightgbm (3.3.2) and on the latest commit on master (https://github.com/microsoft/LightGBM/commit/f85dfa2c402cc42e3ecf1a960d84a9ceeac908c7), the provided code raises the following error

Cannot set predictor after freed raw data, set free_raw_data=False when construct Dataset to avoid this.

How is continued training different from the first run, if the description states that the original dataset is freed after constructing the dataset? (Not after it's first usage.)

Note that "constructed" has a special meaning in LightGBM. It doesn't mean "called lgb.Dataset()".

LightGBM does some preprocessing like binning continuous features into histograms, dropping unsplittable features, encoding categorical features, and more. That preprocessing is what this project refers to as "constructing" a Dataset.

When you initially call lgb.Dataset() in the Python package, the returned Python object holds information like the raw data and the parameters to use in that preprocessing. When the .construct() method is called on that object, LightGBM passes the raw data and parameters to C++ code like LGBM_DatasetCreateFromMat()

https://github.com/microsoft/LightGBM/blob/f85dfa2c402cc42e3ecf1a960d84a9ceeac908c7/src/c_api.cpp#L1071

That code initializes a LightGBM Dataset object in memory and returns a pointer to it, which is stored in Dataset.handle on the Python side.

Once that Dataset object has been constructed, LightGBM no longer needs your raw input data (e.g. the numpy array passed into lgb.Dataset()). So, by default, it removes its copy of that data.

https://github.com/microsoft/LightGBM/blob/f85dfa2c402cc42e3ecf1a960d84a9ceeac908c7/python-package/lightgbm/basic.py#L1805-L1806

So, back to your example...when you first run lgb_train = lgb.Dataset(...), you've created a Dataset object on the Python side, but it hasn't been "constructed" yet. The first time you use that object for training, LightGBM will "construct" it.

https://github.com/microsoft/LightGBM/blob/f85dfa2c402cc42e3ecf1a960d84a9ceeac908c7/python-package/lightgbm/basic.py#L2577

So if you didn't call the .construct() method on the Dataset before training, then it's first usage is also when it's constructed.

example code showing this (click me)

```python import numpy as np import lightgbm as lgb from sklearn.datasets import load_breast_cancer X, y = load_breast_cancer(return_X_y=True) lgb_train = lgb.Dataset(X, y, init_score=np.zeros(y.size)) train_params = { 'objective': 'binary', 'verbose': -1, 'seed': 42 } # confirm that Dataset handle is None assert lgb_train.handle is None model = lgb.train(train_params, lgb_train, num_boost_round=10) # now the Dataset holds a pointer to a constructed Dataset on the C++ side print(lgb_train.handle) # c_void_p(140426868894112) ```

so what should be done about this?

For now, to re-use the same Dataset for training continuation, I think you'll have to set free_raw_data=False when first calling lgb.Dataset().

Looks like that is exactly what this project does in its tests for training continuation.

https://github.com/microsoft/LightGBM/blob/ce486e5b45a6f5e67743e14765ed139ff8d532e5/tests/python_package_test/test_engine.py#L900-L911

But I think in the future, LightGBM should support the pattern you've described above. I'm not exactly sure where to make changes, but it makes sense to me that you might want to perform continued training on the same Dataset like this.

Linking some relevant discussions: #2899, #2906

microsoft / LightGBM