Open jaguerrerod opened 8 months ago
Hey @jaguerrerod, thanks for using LightGBM. You can create single column datasets one at a time and add them to your full dataset. Here's an example:
import lightgbm as lgb
from sklearn.datasets import make_regression
X, y = make_regression(n_features=5)
# create dataset with single column and target
ds = lgb.Dataset(X[:, [0]], y, feature_name=['x0']).construct()
# add the rest of the columns, these can be read from a file one at a time
for j in range(1, X.shape[1]):
ds.add_features_from(lgb.Dataset(X[:, [j]], feature_name=[f'x{j+1}']).construct())
print(ds.num_feature())
# 5
Please let us know if you have further doubts.
Great, Thank you! Does it works in R? If not, Could I construct the dataset in python, save it and load from R?
I think the R package doesn't have that feature (I looked for calls to LGBM_DatasetAddFeaturesFrom
in R's Dataset and didn't find any), but you should be able to save it from Python and load it in R.
It would be great to have this in the R interface, but at least using Python to generate the dataset incrementally is a workaround. Thank you.
would be great to have this in the R interface
Would you like to contribute that? We'd welcome the help.
Unfortunately, I lack knowledge of C++ for this task. I could offer a reward if someone is willing to do the work for me, but I am unaware of where to post such a request. If anyone interested contacts me, we can discuss it further, and of course, it would be for incorporation into the master and sharing it with the community.
No problem, thanks anyway for considering it.
Is there a way to generate an
lgb.Dataset
by reading a file column-wise? I don't think so, but I'm not sure why this functionality doesn't exist. Dataset creation is a bottleneck that prevents us from utilizing all the RAM. Once the bins for each variable are created, especially if there are few like in my case (around 25 numeric values per variable), the dataset size is drastically reduced. However, to achieve this, we need to load the data into RAM, typically from a disk file, and this intermediate step consumes several times the RAM required by thelgb.Dataset
. In practice, if we have X RAM, we can only use X/2 or even less. Would it not be possible to read a CSV where each row contains the data for one column, perform binning, and sequentially free up the RAM? Is there any other alternative to fully utilize all the RAM?