Closed zansibal closed 6 months ago
Thanks for using LightGBM.
Can you share the code you used to estimate that memory usage? For example, is that the memory usage of just the Dataset
after construction, or peak memory usage throughout training?
I ask because it's possible that for a sufficiently large model (in terms of n_trees * num_leaves
), the memory usage of the model could be larger than that of the Dataset
.
Hi, thanks for the quick response.
I am using nvidia-smi
to monitor the VRAM usage.
I checked just now and, from what I can see, there are three steps in the memory allocation.
dataset.construct()
, the VRAM usage goes to 315 MB.lgb.train()
, the VRAM jumps to 3800 MB during initialization.Some of the training params:
model_params = {
'n_estimators': 400,
'learning_rate': 0.01,
'min_data_in_leaf': 7300,
'num_leaves': 1000,
'max_depth': -1,
'boosting': 'gbdt',
'objective': 'mse',
'device_type': 'cuda',
'max_bin': 63,
}
The final model takes about 35 MB of disk space when saving it down.
Great, thanks for that!
So to me, it looks like this statement is not true:
Smaller max_bin should decrease the memory footprint used during training. In my tests, it does not.
It seems that you did observe smaller memory footprint using a smaller max_bin
(e.g. 200MB less VRAM going from 255 to 63 bins).
And that the size of the model is the dominant source of memory usage in your application, not the Dataset
.
num_leaves=1000
is will generate very large trees, and with n_estimators = 400
you're asking LightGBM to generate up to 400 of them.
I recommend trying some combination of the the following to reduce the size of the model:
num_leaves
n_estimators
min_gain_to_split
min_data_in_leaf
max_depth
(good values of this will depend on num_leaves
, see the discussion in #6402)You can also try quantized training, which is available in the CUDA version since #5933. See https://lightgbm.readthedocs.io/en/latest/Parameters.html#use_quantized_grad. With quantized training, the gradients and hessians are represented with smaller data types. That allows you to trade some precision in exchange for lower memory usage.
Thanks for taking the time.
Although I am not sure it is the model taking this amount of space, I am starting to realize other necessary data structures are consuming memory as well (like you mention the gradients and hessians).
I will experiment with quantized training. Thanks for the tip.
Although I am not sure it is the model taking this amount of space, I am starting to realize other necessary data structures are consuming memory as well (like you mention the gradients and hessians).
You are totally right! It was a bit imprecise for me to say "the model".
The training-time memory usage has these 4 main sources:
Dataset
(which includes things like init_score
and weight
)Booster
(really "the model")You can avoid the memory usage for the raw data by constructing a Dataset
directly from a file (either a CSV/TSV/LibSVM file or a LightGBM Dataset
binary file).
You can reduce the memory usage of the Dataset
by using smaller max_bin
or high min_data_in_bin
. Or by removing irrelevant features before construction. In the Python package, if you construct a Dataset
in the same process where you perform training, you can avoid LightGBM storing a copy of the raw data by passing free_raw_data=True
.
You can reduce the memory usage of the Booster
by some of the strategies I mentioned in https://github.com/microsoft/LightGBM/issues/6319#issuecomment-2064772141.
You can reduce the memory usage of the other data structures by trying quantized training. If you have a lot of rows and any are identical or very similar, you could also try collapsing those into a single row and using weighted training to capture the relative representation of those samples in the whole training dataset.
We should get more of this information into the docs, sorry 😅
The other complication here in your case is which of these data structures are stored on the host memory, the GPU's memory, or both. That's an are of active development in LightGBM right now. If you're familiar with CUDA and want to look through the code here, we'd welcome contributions that identify ways to cut out any unnecessary copies being held in both places.
Summary
Smaller
max_bin
should decrease the memory footprint used during training. In my tests, it does not.Motivation
Less memory requirement makes it possible to train on larger datasets. This is especially important in
gpu
andcuda
mode, where VRAM is scarce.Description
It is recommended to test different
max_bin
settings forgpu
andcuda
, to speed up the training, like15
,63
, and255
. While testing different settings, there was no significant change in the memory usage of the GPU. This is weird, as each value in the training array should require less number of bits (4 bits for15
, 6 bits for63
and 8 bits for255
). I can appreciate that it is hard to do, given that all of these sizes are equal to, or less than, 1 byte. Is it possible?References
Test results from my particular dataset (running
mse
regression): Data shape (41_865_312, 88) and 14.0 GB (float32) size in numpy before constructing LightGBM dataset.max_bin
Finally, the GPU memory usage is more than half that of the numpy memory usage (that is using single precision floats). Shouldn't the memory usage be a quarter of that (like 3500 MB)?
Btw, the recently added
cuda
support is a tremendous improvement over the oldgpu
.