Optimizing RAM Usage: lgb.Dataset Creation from CSV by columns

microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

https://lightgbm.readthedocs.io/en/latest/

MIT License

16.7k stars 3.83k forks source link

Optimizing RAM Usage: lgb.Dataset Creation from CSV by columns #6358

Open jaguerrerod opened 8 months ago

jaguerrerod commented 8 months ago

Is there a way to generate an lgb.Dataset by reading a file column-wise? I don't think so, but I'm not sure why this functionality doesn't exist. Dataset creation is a bottleneck that prevents us from utilizing all the RAM. Once the bins for each variable are created, especially if there are few like in my case (around 25 numeric values per variable), the dataset size is drastically reduced. However, to achieve this, we need to load the data into RAM, typically from a disk file, and this intermediate step consumes several times the RAM required by the lgb.Dataset. In practice, if we have X RAM, we can only use X/2 or even less. Would it not be possible to read a CSV where each row contains the data for one column, perform binning, and sequentially free up the RAM? Is there any other alternative to fully utilize all the RAM?

jmoralez commented 8 months ago

Hey @jaguerrerod, thanks for using LightGBM. You can create single column datasets one at a time and add them to your full dataset. Here's an example:

import lightgbm as lgb
from sklearn.datasets import make_regression

X, y = make_regression(n_features=5)
# create dataset with single column and target
ds = lgb.Dataset(X[:, [0]], y, feature_name=['x0']).construct()
# add the rest of the columns, these can be read from a file one at a time
for j in range(1, X.shape[1]):
    ds.add_features_from(lgb.Dataset(X[:, [j]], feature_name=[f'x{j+1}']).construct())
print(ds.num_feature())
# 5

Please let us know if you have further doubts.

jaguerrerod commented 8 months ago

Great, Thank you! Does it works in R? If not, Could I construct the dataset in python, save it and load from R?

jmoralez commented 8 months ago

I think the R package doesn't have that feature (I looked for calls to LGBM_DatasetAddFeaturesFrom in R's Dataset and didn't find any), but you should be able to save it from Python and load it in R.

jaguerrerod commented 8 months ago

It would be great to have this in the R interface, but at least using Python to generate the dataset incrementally is a workaround. Thank you.

jameslamb commented 8 months ago

would be great to have this in the R interface

Would you like to contribute that? We'd welcome the help.

jaguerrerod commented 8 months ago

Unfortunately, I lack knowledge of C++ for this task. I could offer a reward if someone is willing to do the work for me, but I am unaware of where to post such a request. If anyone interested contacts me, we can discuss it further, and of course, it would be for incorporation into the master and sharing it with the community.

jameslamb commented 8 months ago

No problem, thanks anyway for considering it.