Python sklearn boosters store the training/validation set, increasing memory use

chriscramer commented 7 years ago

When using sklearn (in general), it's common to have many different models, for example, if you want to create a multi-output regressor, you need to make use of sklearn's MultiOutputRegressor class which wraps the creation of multiple regressors into a single object. I'm doing the same thing, but with a very large number of models (around 1500).

It seems that the LGBM classes include the booster which stores the training and validation sets. If you are using a MultiOutputRegressor model, this means that each regressor has a complete copy of the input data which adds up.

On my machine:

Mac OS X Sierra Intel i7 processors 16 GB RAM C++: gcc 6.2, Python: 2.7, sklearn: 18.1

This consumes all of the RAM fairly quickly and the process eventually dies with a SIGKILL.

Is there a reason that the booster needs to store a copy of the training/validation data after training?

To reproduce, try the following:

import lightgbm as lgb
from sklearn.multioutput import MultiOutputRegressor
import numpy as np
X = np.random.random((50000, 600))
y = np.random.random((50000, 192))
gbr = lgb.LGBMRegressor()
mor = MultiOutputRegressor(gbr)
mor.fit(X, y)

Note that the training data, X and y, is around than 300 MB total, but once you start fitting the regressor, the amount of memory just starts increasing and would get to around 60 GB.

guolinke commented 7 years ago

If you want to solve this, you should use lgb.Dataset class, and call the training by lgb.train.

LightGBM doesn't use the X and y directly. It will construct a lgb.Dataset by X and y first, and then use it to train. And from your code, it will have this construction many times since the provided data is X and y, not the lgb.Dataset. So a booster will have its own construct of lgb.Dataset .

chriscramer commented 7 years ago

I had already considered that, but the problem with that approach is that it breaks the multiclass/multioutput wrappers in sklearn. You would have to re-write those sklearn classes in order to call lgb.train, instead of calling the wrapped class's .fit method

guolinke commented 7 years ago

@chriscramer you also can override the fit method, to let it support the training from lgb.Dataset.

chriscramer commented 7 years ago

The fit method in the sklearn classes? That sounds a bit awkward.

guolinke commented 7 years ago

I am not familiar with sklearn. But i think the source code of MultiOutputRegressor(or other class you want to change) is available. You can inherit it, and override its fit function based on its original code.

chriscramer commented 7 years ago

sure, but I could also inherit from the LGBMRegressor, and have it delete the training and validation sets from the booster, right?

I guess I'm questioning the need for the booster to hang onto the training and validation sets after training is complete.

guolinke commented 7 years ago

sure, you can delete the dataset after finishing the training.

chriscramer commented 7 years ago

actually, looking at the code some more, I would have to go that route even if I did use lgb.train. The resulting booster from that method still contains the training and validation sets.

basically, the model returned by lgb.train contains both the training and validation data used to create the model. FWIW, I tend to think that this is not the best approach. The model is the model, regardless of the training data used to create it. If I'm reading things right, the training/validation data are not saved when the model is serialized, so I don't know why they are kept at all after training

guolinke commented 7 years ago

The reason is the training function in booster is called by one-by-one iterations (for flexible usage, e.g. change parameter/data during training. ). It doesn't know when is finish. And the returned booster by lgb.train also can be used to continue training.

@wxchan I think we can add a parameter free_dataset_after_train in lgb.train. And set it to True in sklearn interfaces since them don't interact with lgb.dataset .

chriscramer commented 7 years ago

I think that adding the parameter makes a lot of sense and would certainly address the issues I'm seeing. Thanks for looking into this!

wxchan commented 7 years ago

@guolinke can we just remove all datasets at the end of sklearn fit?

guolinke commented 7 years ago

@wxchan you should set the datasets in booster to null as well.

guolinke commented 7 years ago

@wxchan, it won't work if you just change the training data. However, you can achieve it by set this booster as the "init predictor" of datasets.

chriscramer commented 7 years ago

For what it's worth, my temporary solution to this is to subclass LGBMRegressor so that after fitting it uses cPickle to serialize and deserialize the _Booster, since I can't seem to get the _Booster to release the train_set otherwise.

wxchan commented 7 years ago

I think we can only free the datasets in sklearn interface, avoid adding more parameters to lgb.train.

guolinke commented 7 years ago

@wxchan Sorry, I miss this. Yeah, I can it can work, but remember to set all references of Dataset to None.

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

microsoft / LightGBM

Python sklearn boosters store the training/validation set, increasing memory use #350