Open Plenitude-ai opened 6 days ago
Thanks for using LightGBM. Someone will help shortly.
I noticed you double-posted this here and to Stack Overflow at the same time (Stack Overflow link). Please do not do that. Maintainers here also monitor the [lightgbm]
tag on Stack Overflow. I could have been spending time preparing an answer here while another maintainer was spending time answering your Stack Overflow post, which would have been a waste of maintainers' limited attention that could otherwise have been spent improving this project. Double-posting also makes it less likely that others with a similar question will find the relevant discussion and answer.
Hello James, Thank you for your reply, yes of course I didn't have it on my mind I just deleted my stack post. Thank you for your dedication to this amazing library !
Hey @Plenitude-ai, thanks for the thorough description. LightGBM's Dataset doesn't save the data as is contained in the original array, it puts features into bins, so the data that is saved is the bin in which the feature value was put into. Is that what you would like to get from lightgbm.Dataset.get_data()
? Once the dataset has been saved there's no way to get the original data back.
100% agree with everything @jmoralez said.
I'll add that there is an open feature request (#5191) for being able to dump out LightGBM's binned representation as an array, which would allow you to at least partially inspect the training data.
You could subscribe to notifications there to be notified if that feature is formally added to the library. And could try some of the workarounds like https://github.com/microsoft/LightGBM/issues/5191#issuecomment-1742263175 mentioned there.
But only do that if there are genuine constraints that lead your application to only having access to a LightGBM Dataset and not the underlying data. If you can store the raw training data alongside the LightGBM Dataset (e.g. in Parquet, pickle, or npy
format), you'll find that much easier and more useful than any of the workarounds described in #5191.
I understand. It seems my comprehension of both the fundamental implementation, and purpose/objective of this class is incomplete (it's more low-level than I thought), thanks for pointing me in the correct direction. I'll stick with pkl representations then, but I found useful to have all (array, label, group) in the same object. Yes the issue your referred me to seems to be interesting and I'll definitely subscribe! May I ask why we can still access the features names' & labels ? How is it saved in the bin representation? Thanks again for your time :)
May I ask why we can still access the features names' & labels ? How is it saved in the bin representation?
LightGBM needs the exact values of the label (after light preprocessing like handling infinite values and NaNs) to calculate the loss, so it's always recoverable as a dense array from the Dataset
object.
No preprocessing is done on feature names, so those also are always recoverable in their original form.
These data structures are almost always smaller than the raw features (and often MUCH smaller).
If you're interested in the lower-level details, I encourage you to look at the source code for the Dataset
:
That is interesting. I went to look at the source code, but I have to say it's a bit above my coding experience as I don't know C/C++, it is a bit hard to understand how/where the numpy array is converted into bins. I went back to reading further more the documentation and now realize that I messed things up between "bin" and ".bin"/binary. I didn't know about the bin representation of the data, for memory optimization I also found that this phrase in the FAQ was very englightening : "LightGBM constructs bin mappers to build trees, and train and valid Datasets within one Booster share the same bin mappers, categorical features and feature names etc., the Dataset objects are constructed when constructing a Booster. If you set free_raw_data=True (default), the raw data (with Python data struct) will be freed." I'm thinking it might be interesting to add this short explanation in the header of the lgb.Dataset documentation, which was my first refering point. Do you think it could be helpful ?
Summary
From a binary file exported using the
lgb.Dataset.save_binary()
method, it is possible to retrieve the feature names (list of string) and the labels (y, as a numpy array) It is not however possible to retrieve the data, as a numpy array. Both thelgb.Dataset.data
andlgb.Dataset.get_data()
will send back a string with the name of the binary file. It should, in my opinion, send back the numpy array, just as thelabel
. Note it seems as well that thegroup
is also not accessible.The data is effectively contained in the binary file, because we are able to load and train a booster with it. We should be able to get it from the newly created lgb.Dataset object
Motivation
This would allow us to properly investigate the exported datasets. Like computing some statistics (mean, stdv etc) to understand our dataset, before using it again (or not) to train a booster.
Description
The
.data
atttribute and/or.get_data()
method send back the proper numpy array. Thegroup
and/or.get_group()
send back the proper list of intReferences
I executed the following script on the MQ2008, a microsoft open ranking dataset You should be able to reproduce the same results. I also provide an output so that you can understand what are the atttibutes and method that should have a modified behaviour
Which outputs :