Use less memory when decreasing parameter `max_bin`

zansibal commented 8 months ago

Summary

Smaller max_bin should decrease the memory footprint used during training. In my tests, it does not.

Motivation

Less memory requirement makes it possible to train on larger datasets. This is especially important in gpu and cuda mode, where VRAM is scarce.

Description

It is recommended to test different max_bin settings for gpu and cuda, to speed up the training, like 15, 63, and 255. While testing different settings, there was no significant change in the memory usage of the GPU. This is weird, as each value in the training array should require less number of bits (4 bits for 15, 6 bits for 63 and 8 bits for 255). I can appreciate that it is hard to do, given that all of these sizes are equal to, or less than, 1 byte. Is it possible?

References

Test results from my particular dataset (running mse regression): Data shape (41_865_312, 88) and 14.0 GB (float32) size in numpy before constructing LightGBM dataset.

`max_bin`	VRAM usage	training time
255	8900 MB	223 s
63	8700 MB	200 s
15	8700 MB	204 s

Finally, the GPU memory usage is more than half that of the numpy memory usage (that is using single precision floats). Shouldn't the memory usage be a quarter of that (like 3500 MB)?

Btw, the recently added cuda support is a tremendous improvement over the old gpu.

jameslamb commented 8 months ago

Thanks for using LightGBM.

Can you share the code you used to estimate that memory usage? For example, is that the memory usage of just the Dataset after construction, or peak memory usage throughout training?

I ask because it's possible that for a sufficiently large model (in terms of n_trees * num_leaves), the memory usage of the model could be larger than that of the Dataset.

zansibal commented 8 months ago

Hi, thanks for the quick response.

I am using nvidia-smi to monitor the VRAM usage.

I checked just now and, from what I can see, there are three steps in the memory allocation.

When running dataset.construct(), the VRAM usage goes to 315 MB.
When running lgb.train(), the VRAM jumps to 3800 MB during initialization.
When training actually starts, the usage jumps to the aforementioned peak 8700 MB, and stays there the whole time.

Some of the training params:

model_params = {
    'n_estimators': 400,
    'learning_rate': 0.01,
    'min_data_in_leaf': 7300,
    'num_leaves': 1000,
    'max_depth': -1,
    'boosting': 'gbdt',
    'objective': 'mse',
    'device_type': 'cuda',
    'max_bin': 63,
}

The final model takes about 35 MB of disk space when saving it down.

jameslamb commented 6 months ago

Great, thanks for that!

So to me, it looks like this statement is not true:

Smaller max_bin should decrease the memory footprint used during training. In my tests, it does not.

It seems that you did observe smaller memory footprint using a smaller max_bin (e.g. 200MB less VRAM going from 255 to 63 bins).

And that the size of the model is the dominant source of memory usage in your application, not the Dataset.

num_leaves=1000 is will generate very large trees, and with n_estimators = 400 you're asking LightGBM to generate up to 400 of them.

I recommend trying some combination of the the following to reduce the size of the model:

use early stopping
reduce num_leaves
reduce n_estimators
increase min_gain_to_split
increase min_data_in_leaf
set max_depth (good values of this will depend on num_leaves, see the discussion in #6402)

You can also try quantized training, which is available in the CUDA version since #5933. See https://lightgbm.readthedocs.io/en/latest/Parameters.html#use_quantized_grad. With quantized training, the gradients and hessians are represented with smaller data types. That allows you to trade some precision in exchange for lower memory usage.

zansibal commented 6 months ago

Thanks for taking the time.

Although I am not sure it is the model taking this amount of space, I am starting to realize other necessary data structures are consuming memory as well (like you mention the gradients and hessians).

I will experiment with quantized training. Thanks for the tip.

jameslamb commented 6 months ago

Although I am not sure it is the model taking this amount of space, I am starting to realize other necessary data structures are consuming memory as well (like you mention the gradients and hessians).

You are totally right! It was a bit imprecise for me to say "the model".

The training-time memory usage has these 4 main sources:

the raw data
the LightGBM Dataset (which includes things like init_score and weight)
the LightGBM Booster (really "the model")
other data structures used in training but not preserved when you save the model (e.g. the gradients and hessians, which I was carelessly also including in what I referred to as "the model")

You can avoid the memory usage for the raw data by constructing a Dataset directly from a file (either a CSV/TSV/LibSVM file or a LightGBM Dataset binary file).

You can reduce the memory usage of the Dataset by using smaller max_bin or high min_data_in_bin. Or by removing irrelevant features before construction. In the Python package, if you construct a Dataset in the same process where you perform training, you can avoid LightGBM storing a copy of the raw data by passing free_raw_data=True.

You can reduce the memory usage of the Booster by some of the strategies I mentioned in https://github.com/microsoft/LightGBM/issues/6319#issuecomment-2064772141.

You can reduce the memory usage of the other data structures by trying quantized training. If you have a lot of rows and any are identical or very similar, you could also try collapsing those into a single row and using weighted training to capture the relative representation of those samples in the whole training dataset.

We should get more of this information into the docs, sorry 😅

jameslamb commented 6 months ago

The other complication here in your case is which of these data structures are stored on the host memory, the GPU's memory, or both. That's an are of active development in LightGBM right now. If you're familiar with CUDA and want to look through the code here, we'd welcome contributions that identify ways to cut out any unnecessary copies being held in both places.

microsoft / LightGBM