Closed chriscramer closed 7 years ago
If you want to solve this, you should use lgb.Dataset
class, and call the training by lgb.train
.
LightGBM doesn't use the X
and y
directly. It will construct a lgb.Dataset
by X
and y
first, and then use it to train. And from your code, it will have this construction many times since the provided data is X
and y
, not the lgb.Dataset
. So a booster
will have its own construct of lgb.Dataset
.
I had already considered that, but the problem with that approach is that it breaks the multiclass/multioutput wrappers in sklearn. You would have to re-write those sklearn classes in order to call lgb.train, instead of calling the wrapped class's .fit method
@chriscramer you also can override the fit
method, to let it support the training from lgb.Dataset
.
The fit method in the sklearn classes? That sounds a bit awkward.
I am not familiar with sklearn.
But i think the source code of MultiOutputRegressor
(or other class you want to change) is available. You can inherit it, and override its fit
function based on its original code.
sure, but I could also inherit from the LGBMRegressor, and have it delete the training and validation sets from the booster, right?
I guess I'm questioning the need for the booster to hang onto the training and validation sets after training is complete.
sure, you can delete the dataset after finishing the training.
actually, looking at the code some more, I would have to go that route even if I did use lgb.train. The resulting booster from that method still contains the training and validation sets.
basically, the model returned by lgb.train contains both the training and validation data used to create the model. FWIW, I tend to think that this is not the best approach. The model is the model, regardless of the training data used to create it. If I'm reading things right, the training/validation data are not saved when the model is serialized, so I don't know why they are kept at all after training
The reason is the training function in booster
is called by one-by-one iterations (for flexible usage, e.g. change parameter/data during training. ). It doesn't know when is finish. And the returned booster by lgb.train also can be used to continue training.
@wxchan I think we can add a parameter free_dataset_after_train
in lgb.train. And set it to True
in sklearn interfaces since them don't interact with lgb.dataset
.
I think that adding the parameter makes a lot of sense and would certainly address the issues I'm seeing. Thanks for looking into this!
@guolinke can we just remove all datasets at the end of sklearn fit?
@wxchan you should set the datasets in booster to null as well.
@wxchan, it won't work if you just change the training data. However, you can achieve it by set this booster as the "init predictor" of datasets.
For what it's worth, my temporary solution to this is to subclass LGBMRegressor so that after fitting it uses cPickle to serialize and deserialize the _Booster, since I can't seem to get the _Booster to release the train_set otherwise.
I think we can only free the datasets in sklearn interface, avoid adding more parameters to lgb.train.
@wxchan Sorry, I miss this. Yeah, I can it can work, but remember to set all references of Dataset to None.
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.
When using sklearn (in general), it's common to have many different models, for example, if you want to create a multi-output regressor, you need to make use of sklearn's MultiOutputRegressor class which wraps the creation of multiple regressors into a single object. I'm doing the same thing, but with a very large number of models (around 1500).
It seems that the LGBM classes include the booster which stores the training and validation sets. If you are using a MultiOutputRegressor model, this means that each regressor has a complete copy of the input data which adds up.
On my machine:
Mac OS X Sierra Intel i7 processors 16 GB RAM C++: gcc 6.2, Python: 2.7, sklearn: 18.1
This consumes all of the RAM fairly quickly and the process eventually dies with a SIGKILL.
Is there a reason that the booster needs to store a copy of the training/validation data after training?
To reproduce, try the following:
Note that the training data, X and y, is around than 300 MB total, but once you start fitting the regressor, the amount of memory just starts increasing and would get to around 60 GB.